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-4— > ■ Abstract 

■ Wc study statistical risk minimization problems under a privacy model in which the data 
is kept confidential even from the learner. In this local privacy framework, we establish sharp 

■ upper and lower bounds on the convergence rates of statistical estimation procedures. As a 
consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and 

I 1 ■ the utility, as measured by convergence rate, of any statistical estimator or learning procedure. 
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1 Introduction 



There are natural tensions between learning and privacy that arise whenever a learner must aggre- 
gate data across multiple individuals. The learner wishes to make optimal use of each data point, 
^ ■ but the providers of the data may wish to limit detailed exposure, either to the learner or to other 

. individuals. It is of great interest to characterize such tensions in the form of quantitative tradeoffs 

I that can be both part of the public discourse surrounding the design of systems that learn from 

^ ■ data and can be employed as controllable degrees of freedom whenever such a system is deployed. 

. In this paper we approach this problem from the point of view of statistical decision theory. The 

decision-theoretic perspective offers a number of advantages. First, the use of loss functions and 
■ risk functions provides a compelling formal foundation for defining "learning," one that dates back 



to Wald [42( 1 in the 1930's, and which has seen continued development in the context of research 
on machine learning over the past two decades. Second, by formulating the goals of a learning 
^ ■ system in terms of loss functions, we make it possible for individuals to assess whether the goals of 

a learning system align with their own personal utility, and thereby determine the extent to which 
they are willing to sacrifice some privacy. Third, an appeal to decision theory permits abstraction 
over the details of specific learning procedures, providing (under certain conditions) minimax lower 
bounds that apply to any specific procedure. Fourth, the use of loss functions, in particular convex 
loss functions, in the design of a learning system allows the powerful tools of optimization theory to 
be brought to bear. Not only are optimization-based learning systems often successful in practice, 
but they are also often amenable to theoretical analysis. Finally, the decision-theoretic framework 
is a probabilistic framework, bringing probabilities to bear in the transformation from losses to 
risks, and this provides a natural hook for the use of randomization to provide control over privacy. 

In more formal detail, our framework is as follows. Given a compact convex set C M"^, we 
wish to find a parameter value 9 € @ achieving good average performance under a loss function 
£ : X X — )• M_|_. Here the value £{X, 0) measures the performance of the parameter vector ^ E 
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on the sample X ^ X, and £{x,-) : M.'^ — )■ is convex for x ^ X. We measure the expected 
performance of G via the risk function 

6 ^ R{0) :=Kp[i{X,e)], (1) 

where the expectation is taken over some unknown distribution P over the space X. 

In the standard formulation of statistical risk minimization, a method M is given n samples 
Xi, . . . , Xn, each drawn independently from P, and its goal to to output an estimate that 
approximately minimizes the risk function R. In this paper, instead of providing the method Ai 
with access to the samples Xi, . . . ,Xn, however, we study the effect of giving only some disguised 
view Zi of each datum Xi. With On now denoting an estimator based on the perturbed samples Zi, 
we explicitly quantify the rate of convergence of R{9n) to infgge -R(^) as a function of the number 
of samples n and the amount of privacy provided by Zj. 



1.1 Prior work 



There is a long history of research at the intersection of privacy and statistics, going back at least to 
the 1960s, when Warner 43|] suggested privacy-preserving methods for survey sampling, and to later 



work related to census taking and presentation of tabular data fe.g. . |l9(| ). More recently, tner( 
been a large amount of computationally-oriented work on privacy HI El, 0, H, Hi, H, 0, U 



there^ has 

We overview some of the key ideas in this section, but cannot hope to do justice to the large body 
of related work, referring the reader to the comprehensive survey by Dwork 15|] and the statistical 
treatment by Wasserman and Zhou 4J] for background and references. 

Most work on privacy attempts to limit disclosure risk: the probability that some adversary 
can link a released record to a particular member of the population or identify that someone 
belongs to a dataset that generates a statistic 

0,1 

Il4l. l37l. l26l|. In the statistical literature, work 
on disclosure limitation and so-called linkage risk, for example as in the framework of Duncan 
and Lambert [13l. Il4i|. has yielded several techniques for maintaining privacy, such as aggregation, 
swapping features or responses among different datums, or perturbation of data. Other authors 
have proposed measures for measuring utility of released data (e.g^ji^,^]). The currently standard 
measure of privacy is differential privacy, due to Dwork et al. 171], which roughly states that 0^ 
must not depend too much on the n samples, and it should be difficult to ascertain whether a vector 
X belongs to the set {Xi, . . . , Xn} given On- Formally, paraphrasing the definition of Wasserman 
and Zhou 4j], the method Ai has a-differential privacy if 



sup sup sup 

S£a{0) xi,...,x„ 



fl{S \ Xi = Xi,. . . ,X„ 
/i(5 I Xi = x[, ...,Xn 



< exp(a). 



(2) 



where the sets xi,...,Xn and x[,...,x'n differ in at most one element, | Xi,...,Xn) is (a 
version of) the conditional probability of the estimator 6 constructed by the method M using the 
n samples, and cr{@) is a suitable cr-algebra on Q. 



Differentially private algorithms enjoy many desirable properties 17|,ll5l] and essentially guaran- 
tee that even if an adversary knows all the entries in a dataset but the nth, it is difficult to discern 
whether a vector x is equal to Xn given the output of the method M . Several researchers have stud- 
ied differentially private algorithms for empirical risk minimization, providing guarantees on the 
excess risk of differentially private estimators 9. Chaudhuri et al. j3] use the stability of the output 
of regularized empirical risk minimization algorithms to show that by adding Laplace-distributed 
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noise to an empirical estimator 6 or by adding an additional random term to the empirical risk 



n Sr=i ^(^i) it is possible to obtain differential privacy and consistency of 9. Dwork and Lei 161] 
obtain similar results using robust statistical estimators, and Smith [43] shows that if one has suit- 
ably unbiased estimators, then differential privacy is possible without compromising asymptotic 



rates of convergence. Rubinstein et al. [38[] use similar stability and perturbation techniques to 
demonstrate that it is possible to obtain differential privacy when solving support vector machine 
problems, and also show that if the desired privacy level a in the definition ^ is too small, it is 
actually impossible to obtain a parameter On minimizing the risk R. 

Our goal is to understand the fundamental tradeoffs between maintaining privacy while still 
providing a useful output from the statistical learning procedure M. Though intuitively there 
must be some tradeoff, quantifying it precisely has been difficult. As mentioned above, Rubinstein 



et al. 38(] are able to show that it is impossible to obtain what they call an (e, (5)-useful parameter 



vector 9 that enjoys any differential privacy guarantees; however, it is unknown whether or not 



their guarantees might be improvable. Hall et al. |22l ] obtain minimax rates of convergence for 
differentially private histogram estimation, showing that if a histogram has d bins and we must 
guarantee a-differential privacy (l2|), then the minimax L^-risk of the histogram estimator is d/ (na), 
and Hardt and Talwar [23( give similar lower bounds on the amount of noise necessary to answer 
linear database queries. Blum et al. [3] also give lower bounds on the closeness of certain statistical 
quantities computed from the dataset, though their upper and lower bounds do not match. Sankar 
et al. [s^ provide rate-distortion theorems for utility models involving information-theoretic quan- 
tities, which has some similarity to our risk-based framework, but it appears somewhat challenging 
to explicitly map their setting onto ours. With the goal of characterizing what it means to be both 



useful and private, Ghosh et al. [20(] show that for a one-time computation of counts on a dataset 
Xi, . . . , Xn (i.e., the number of variables satisfying Xi £ C for some set C), perturbing the output 
of a counting function using geometrically distributed noise is the unique optimal way to guarantee 
differential privacy while maximizing a natural notion of utility. 

1.2 Our setting 



In contrast to the above work, we study a more local notion of privacy |18l.l27l]. in which each datum 
Xi is kept private from the method M . The goal of many types of privacy is to guarantee that the 
output On of the method M based on the data cannot be used to discover information about the 
individual samples Xi, . . . , Xn , but locally private algorithms only access disguised views of each 
datum Xi. Local algorithms are among the most classical ap proaches to privacy, tracing back to 



work on randomized response in the statistical literature 43|], and rely on communication only of 
some disguised view Zi of each true sample Xi. 

Locally private algorithms are natural when the providers of the data — the population sampled 
to give Xi , . . . , Xn — do not even trust the statistician or statistical method Ai , but the providers 
are interested in the parameter vector 9* that minimizes the risk function. For example, in medical 
applications, a participant may be embarrassed about his use of drugs, or perhaps about his marital 
status, but if the loss i is able to measure the likelihood of developing cancer, then the participant 
has high utility for access to the optimal parameters 9*. Internet applications, where a user's activity 
is logged across multiple websites or searches, provide another example: the user has a utility for a 
search engine to have a ranking function 9 that returns relevant results for web searches, yet may 
not wish to reveal his or her search data. In essence, we would like the statistical procedure A4 to 
learn from the data Xi, . . . , Xn but not about it. 
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The work most related to ours seems to be that of Kasiviswanathan et al. [27[, who show 
that that (in some settings) locally private algorithms coincide with concepts that can be learned 
with polynomial sample complexity in Kearns's statistical query (SQ) model [2^. This result is 
powerful, but has some limitations, as the statistical query model relies exclusively on count queries. 
In contrast, our analysis applies to estimators deriving from a broad class of convex risks ([1]) and 
it provides sharp rates of convergence. 

We develop our approach to local privacy in the setting of two related privacy measures. The 
first is a worst-case measure of mutual information, where we view privacy preservation as a game 
between the providers of the data — who wish to preserve privacy — and nature. The second is 
based on differential privacy, where the provider of each datum communicates — subject to some 
constraints we make explicit later — the most differentially private view Zi of his or her datum Xj. 

Turning first to the information-theoretic formulation, and recalling that the method Ai sees 
only the perturbed version Zi of Xi, we use a uniform variant of mutual information I{Zi]Xi) 
between the random variables Xi and Zi as our measure for privacy. Using mutual information and 
related information-theoretic ideas in the privacy and security context is by no means original; see. 



for example, the survey |3l[ |. It is important to note, however, that standard mutual information 



has deficiencies as a measure of privacy [e.g., Il8|]. Accordingly, our uniform notion of mutual 



information is as follows: we say that the distribution Q generating Z from X is private only if 
I{X;Z) is small for all possible distributions P on X, possibly subject to some constraints. 

In this setting, we design procedures that allow consistent estimation of the parameter 9* 
minimizing R{9) = Mp[i{X,9)], for any convex loss i and distribution P on the data X. One 
central consequence of our analysis is a sharp characterization of the excess risk 

An(9;£,e) :=E - mi R{9) (3) 

L \ /J 6g& 

associated with any estimator 9 that satisfies a pre-specified privacy constraint. For particular 
collections £ of loss functions i £ we bound the minimax convergence rate of all estimation 
procedures. More precisely, if ones wishes to guarantee a level of privacy I{Xi \ Zi) < I* , then we 
show that there exists a constant a{£,, 0) G — dependent only on the properties of the collection 
£ and domain — such that for any estimator 9 for the family £, the excess risk is lower bounded 
as 

supA„(g;A0) > "^^^ (4a) 

Moreover, we also prove that there exists another constant b{£, 0) > a{£, 0) and provide explicit 
estimators 9 with privacy guarantee I* such that 

supA„(^;£,0)<^^^. (4b) 

Turning to the setting of differential privacy, we are able to show similar results to the bounds (j4ap 
and ()4bp . Namely, there exist constants 6'(£, 0) > a'(£, 0) such that if we wish to guarantee a- 
differential privacy, then for any estimator 9, the risk is lower bounded by 

supA„(g;£,0) > "'(^'g) , (5a) 
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while there exist estimators 9 such that 



supA„(^;^,e)<^^%£^. (5b) 



Finahy, we show that stochastic gradient descent is one procedure that achieves the above upper 
bounds, and moreover, that the ratios 0)/a(£, 0) and 0)/a'(£, 0) are bounded above by 
a universal (numerical) constant. The bounds ([H) and ^ thus establish and quantify explicitly the 
sharp tradeoff between learning and statistical estimation and the amount of privacy provided to 
the population. Moreover, the algorithms we use to give the upper bounds apply in streaming and 
online settings, requiring only a fixed-size memory footprint. 

Our subsequent analysis will build on this favorable property of gradient-based methods. Indeed, 
in the remainder of the paper, we will assume that the communication protocol by which data is 
conveyed to the learner M is based on (sub)gradients of the loss. As further motivation for this 
choice, note that the subgradient (more generally, a score function) of the loss ^ is asymptotically 
sufficient in the sense of Le Cam jiol ]. A bit more precisely, gradients (in an asymptotic sense) 
contain all of the statistical information for risk minimization problems. Secondly, estimation 



procedures based on stochastic gradient information are asymptotically efficient [36|] , in the sense of 



both Bahadur and minimax efficiency [4l|, Chapter 8], and are thus essentially sample optimal; they 



also have minimax-optimality guarantees in finite-sample settings p|. Moreover, many — perhaps 



most — estimation procedures are gradient-based [33|, |6[ , and distributed optimization procedures 



that send subgradient information across a network to a centralized procedure M are natural [e.g. 

Our arguments will also show that disguising subgradients is (in many settings) equivalent to 
disguising the data X itself. 



1.3 Outline and techniques 

We spend the remainder of the paper deriving the bounds displayed in and Our route 
to obtaining these bounds is based on a two-part analysis. First, we consider saddle points of 
the mutual information I{X]Z), when viewed as a function of the distribution P oi X and the 
conditional distribution Q{- \ X) of Z, under natural constraints that still allow estimation. We 
consider related saddle points for differentially private conditional distributions. Having computed 
these saddle points, we can apply information-theoretic techniques for obtaining lower bounds on 
estimation jiil . [H to prove the results of the form (|4b|) or (|5b|) . Our upper bounds then follow by 
application of known convergence rates for computationally efficient methods, such as the stochastic 
gradient and mirror descent algorithms [13, [s^ . 

The remainder of the paper is organized as follows. We give a precise definition of our notions 
of local privacy in Section [2l Section [3] is devoted to information-theoretic lower bounds on the 
convergence rate of any statistical method M in terms of the mutual information /* between what 
the method M observes and each sample Xi. We characterize the unique privacy guaranteeing 
distributions in Section [U which provides a constructive mechanism for trading off privacy and 
learning. We devote Section [5] to the proofs of results given in Section [3l with our more technical 
results deferred to the appendices. We present our conclusions in Section [6l 



Notation Before continuing, we give our notation and a few standard definitions. The Kullhack- 
Leibler (KL) divergence between distributions P and Q defined on a set 5, where P and Q are 
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assumed to have densities p and q with respect to a base measure 43 is given by 

Js Q{s) 

Similarly, the total-variation distance between the distributions P and Q is defined as 

IIP - QIItv := sup \P{A) - Q{A)\ = \ f \p{s) - q{s)\du{s). 
Acs ^ Js 

For a convex function / : M"' — t- MU {+00}, the subgradient set df{9) of / at the point 9 is 

df{e) := [geR'': f{9') > f{9) + {9' - 9), for all 9' G E'^} . 

We use di{x, 9) to denote the subgradient set of the function 9 1— )• i{x, 9), and for a convex function, 
V£{x,9) denotes an arbitrary element of di{x,9). We say that a function / is L-Lipschitz with 
respect to the norm ||-|| over the set if 

\f{9)-f{9')\ <L\\9-9'\\ foi ail 9,9' ee. 

The notation ||-||p denotes a standard fp-norm. We use the abbreviation r.c.d. throughout for 
regular conditional distribution [3|. The extreme points of a set C C are denoted by Ext(C), 
the convex hull of C is denoted by Conv(C), and the support of a distribution P is denoted supp P. 
We say values a„ x 6„ if lim.„(a„/6„) = 1. The symbol denotes the ith standard basis vector in 
M*^. Lastly, the symbol ^ denotes a set-valued mapping 24]. 



2 Problem Formulation 

We begin with a formal description of the communication protocol by which information about the 
random variables X is communicated to the procedure M. We then define the notion of optimal 
local privacy studied in this paper, and the minimax framework in which we state our main results. 

2.1 Communication protocol 

In this paper, we focus on statistical learning procedures that have access to data through the 
subgradients di{X, 9) of the loss functions. More formally, at each round, the method M is given 
access to a random vector Zi such that 

E[Zi\X,,9]£d£{X,,9), (6) 

where G is a parameter chosen by the method. In Appendix |A] we present an argument that 
shows that the unbiasedness of the subgradient inclusion ([6]) is not only intuitively appealing but 
is, in a certain sense, necessary and cannot be avoided. 

In detail, our communication protocol consists of the following steps: 

• the method M sends the parameter vector 9 to the owner of the ith sample Xi] 
^This is no loss of generality, as P and Q are absolutely continuous with respect to p — ^{P + Q). 
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• owner i computes a subgradient vector g £ di{Xi,6); 

• the vector Zi is communicated to Ai under the constraint that K[Zi \ Xi,9] G d£{Xi,9). 

We assume throughout that there is a compact set C C M'^ such that di{x,6) C C for all 
pairs {0,x) G G x Our goal is "disguise" the subgradient information with a random variable Z 
satisfying Z £ D, for some compact set D such that C GiniD <ZW^. 

For instance, a common choice of these sets are norm balls, say of the form 

C = {geW^ : \\g\\ < L}, and D = {g £R'^ : \\g\\ < M}, 

where ||-|| is a given norm on W^, and the radius choice M > L ensures that C C int-D. This choice 
covers a variety of online optimization and stochastic approximation algorithms [i^, 3, Hi El j 
for which it is assumed that for any x G Af and G G, if g' G d(.{x^ 6) then H^H < L for some norm 
||-||. We may obtain privacy by allowing a perturbation of the subgradient g, which is then required 
to live in a (larger) norm ball of radius M > L. 



2.2 Optimal local privacy 

Suppose that X has distribution P, and for each x £ X, let Q{- \ x) denote the regular conditional 
probability measure of Z given that X = x. This pair defines the marginal distribution Q(-) via 
Q{A) = 'E[Q{A I X)], where the expectation taken with respect to X ~ P. The mutual information 
between X and Z is the expected Kullback-Leibler (KL) divergence between Q{- \ X) and Q(-): 

I{P, Q) = I{X; Z) := Ep [D^, [Q{- \ X) \\Q{-))\ . (7) 

We view the problem of privacy as a game between the adversary controlling P and the data 
owners, who use Q to obscure the samples X. In particular, we say a distribution Q guarantees a 
level of privacy /* if and only if supp Q) < I*. Note that this guarantee is worst-case, ensuring 
that for any choice of distribution P, the publically available random variable Z provides at most 
mutual information /* about the sample X. 

Our goal is to find a saddle point P*, Q* such that 

sup/(P,Q*) </(P*,Q*) <inf/(P*,Q), (8) 
p Q 

where the first supremum is taken over all distributions P on X such that V£(X, 9) £ C with 
P-probability 1, and the infimum is taken over all regular conditional distributions Q such that if 
Z ~ g(- I X), then Z G P> and Eq[Z \X,e]= Vi{X, 6). Indeed, if we can find P* and Q* satisfying 
the saddle point ([8]), then combination with the trivial direction of the max-min inequality yields 

supinf/(P,g) = I{P\Q*) = inf sup /(P,Q). 
p Q Q p 

To fully formalize this idea and our notions of privacy, we define two collections of probability 
measures and associated losses. For sets C C D C M*^, we define the source set 

V (C) := {Distributions P such that suppP C C} (9a) 

and the set of regular conditional distributions (r.c.d.'s), or communicating distributions, 

Q (C, D) := jr.c.d.'s Q s.t. supp Q{- \ c) C D and j zdQ{z | c) = c for c G c| . (9b) 
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The definitions (j9ap and ()9bp formally define the sets over which we may take infima and suprema 
in the saddle point calculations, and they capture what may be communicated. The conditional 
distributions Q £ Q (C, D) are defined so that for any loss i with V£(x, 9) G C, we have 



EQ[Z\X,e] := [ zdQ{z\Vi{x,e)) = V£{x,9). 
Jd 



We now make the following key definition: 

Definition 1. The conditional distribution Q* satisfies optimal local privacy for the sets C C D CM.'^ 
at level /* if 

sup /(P,Q*) = inf sup/(P,Q) = /*, 
p Q p 

where the supremum is taken over distributions P £ V (C) and the infimum is taken over regular 
conditional distributions Q G Q (C, D). 

We also formulate a corresponding notion of local optimality in the differential privacy setting. 
For given sets C C D, define the differential privacy measure 



a*{C,D) := inflog 



Q(S\X = x) 

sup sup 



SG(7(D) x,x'ec Q{S \ X — x') 



(10) 



where the infimum is taken over all regular conditional distributions Q £ Q{C,D) such that 
Eq[Z \ X = x] = X. We define optimal local differential privacy as follows: 

Definition 2. The conditional distribution Q* satisfies optimal local differential privacy for the 
sets C CD cW^ii 

sup I{P, Q* ) = inf sup I{P,Q), 
P Q p 

where the supremum is taken over all distributions P £ V (C), and the infimum is taken over all 
a*(C, D)-differentially private regular conditional distributions Q £ Q{C,D). 

If a distribution Q* satisfies optimal local privacy or optimal local differential privacy, then it 
guarantees that even for the worst possible distribution on X, the information communicated about 
X is limited. (Part of our results consist in showing that for suitable sets C C -D, it is possible 
to attain a*{C,D), so it is sensible to, in addition, choose the distribution that minimizes mutual 
information.) 

In a sense, Definitions [Hand [2] capture the natural competition between privacy and learnability. 
The method M specifies the set D to which the data Z it receives must belong; the "teachers," or 
owners of the data X, choose the distribution Q to guarantee as much privacy as possible subject to 
this constraint. Using these mechanisms, if we can characterize a unique distribution Q* attaining 
the infimum ([8]) for P* (and by extension, for any P), then we may study the effects of requiring a 
bounded amount of information to be communicated to the method M about X, which we do in 
Section [31 



8 



2.3 Minimax error 



Given an estimate 6 based on n samples X from a distribution P, we assess its quality in terms 
of the risk function R{0) = E,[i{X,9)]. In this section, we describe the minimax framework for 
obtaining bounds uniformly over all possible estimators. 

More precisely, let M denote any statistical procedure or method that operates on stochastic 
gradient samples, and let On denote the output of Ai after receiving n such samples. The excess 
risk of the method Ai on the risk R{0) after receiving n sample gradients is given by 

eniM,£,e,P) := RiOn) - inf R{9) = Ep[i{X,9„)] - inf Ep[e{X,6)]. (11) 

Note that this excess risk is a random variable, since the output On of the method is a random 
varable. 

In our settings, in addition to the randomness in the sampling distribution P, there is additional 
randomness from the perturbation applied to stochastic gradients of the objective i{X, •) to mask 
X from the statistitician or method A4. Let Q denote the regular conditional probability — the 
channel distribution — whose conditional part is defined on the range of the (set- valued) subgradient 
mapping di{X, •) : O ^ M*^. Since the output On of the statistical procedure is a random function 
of both P and Q, we take the expectation and measure the expected sub-optimality of the risk 
according to P and Q. We let £ denote a collection of loss functions, where for a distribution P 
on X, the set £{P) denotes the losses i : suppP x — belonging to £. The minimax error is 
then given by 

e;(£,G) :=supinf sup Ep,Q[en{M,i,@,P)], (12) 
P ^ ^g£(P) 

where the expectation is taken over the random samples X ~ P and Z ~ Q(- \ X,9). In this 
paper, we provide characterizations of the minimax error (jl2p for several classes of loss functions 
£(P), giving sharp results when the privacy distribution Q satisfies optimal local privacy for any 
loss function i G ^(P) and distribution P. 



3 Optimal Learning Rates and Tradeoffs 

With our framework in place, we now turn to statements of our main results. We begin by imposing 
certain (weak) conditions on the families of loss functions that we consider, and subsequently turn 
to the main results of this section (Theorems [H [21 and [3]) as well as some of their consequences 
(Corollaries H [21 andlSj). 



3.1 Families of loss functions 

We assume that our collection of loss functions obey certain natural smoothness conditions. For 
each p E [l,oo], we use ||-||p to denote the usual ^p-norm, and we use q = to denote the 
conjugate exponent satisfying the relation 1/p + l/q = 1. With this notation, we have the following 
definition: 

Definition 3. For parameters L > and p > 1, an {L,p)-loss function is a measurable function 
£ : X X Q ^ M. such that for P-almost every x G X, the function i— )• £{x,0) is convex and 
L-Lipschitz continuous with respect to the norm ||-|L. 



9 



A convex loss i satisfies Definition [3] if and only if for all G 0, we have the inequality \\g\\p < L 
for any subgradient g G d£{x,d) (e.g. [i^]). 

In order to illustrate this definition, let us consider a few examples: 

Example 1. As a simple example, we may consider finding a multi-dimensional median, in which 
case the data x G M"' and 

i{x,e) = L\\e-x\\^. 

This loss is L-Lipschitz with respect to the £i-norm, its subgradients belonging to [—LjLY, and 
hence i belongs to the class of (L, oo)-loss functions. 

Example 2 (Classification). We may also consider classification based on either the hinge loss or 
logistic regression loss. In this setting, the data comes in pairs x = (a, 6), where a G M"' is the set 
of regressors or predictors and b G {—1, 1} is the label; the losses are 

£{x,e) = [l-b{a,e)], and £(x, = log (1 + exp(-6 (a, ^))) . 



By computing (sub)gradients, we may verify that each of these belong to the class of (L, p)-losses if 
and only if the covariate vector a G M'^ satisfies ||a|| < L, which is a common assumption 



p)-losS' 

BE 

Definition [3] is natural given the communication strategy we outline in Section [2?n Since our loss 
functions satisfy ||3^(X, 0)|| < L, the channel distribution Q amounts to perturbing subgradients 
to larger norm balls while maintaining the appropriate expectations. 



3.2 Bounds on minimax errors 

We now state our three main theorems, deferring proofs to Section [5j Our first theorem applies to 
the class of (L, oo) loss functions as given in Definition [3l For this theorem, we assume that the set 
to which the perturbed data Z must belong is [— Mqo, Mqo]'^, where Mqo > L. In the notation of 
Definitions m and O this corresponds to taking C = [—L,L]'^ and D = [— Mqo, Mqo]"'. We state two 
variants of the first theorem, as one gives slightly sharper results for an important special case. 

Theorem 1. Let 2, be the collection o/(L, oo) loss functions, assume the conditions of the preceding 
paragraph, and let Q be optimally locally private (Definition\^ for £. Then 

(a) If Q contains the ioo ball of radius r, 

loo wn 



(b) ifQ = {ee 



\i< r} and d >2, 



lb vn 



Our second main theorem applies to loss functions and objectives with a different geometry 
than the first and last. Now we assume that the loss functions £ consist of (L, 1) losses, and that 
the perturbed data must belong to the £i ball of radius Mi, i.e., Z G G M'^ : \\z\\j^ < Mi}. Thus 
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in the notation of Definition [H we have D = {Mi/L)C, where C = {g ^ : \\g\\i < L}. If we 
define M = Mi/L, we may define the constants 



, hd-2+ J(2d - 2 2 + 4 M2 - 1 \ , , e^-e-^ 

7 := log ^ and A 7 := -, 13) 

' 2(M-1) j ^'^ eT + e-T + 2((i - 1) ^ ^ 

which are related to the unique distribution achieving optimal local privacy for the (L, 1) losses 
and the larger ii ball above (see equation (|15p and Proposition [2]). We have the following theorem. 

Theorem 2. Let £ be the collection of (L, 1) loss functions, assume the conditions of the preceding 
paragraph, and let Q be optimally private for the collection £. IfQ contains the l^-ball of radius r, 

<.(£.e)>^ --^^ 



163 V^A(7)' 

For our final main theorem, we focus on differentially private algorithms, where we assume that 
communication respects optimal local differential privacy, as given by Definition [2j We use the 
same collection of loss functions £ as in Theorem [H that is, (L, oo)-loss functions. We also assume 
that the set to which the perturbed data Z belong is [—M^y^, MooY" , though the specific value of 
Moo is not actually important for the statement of the theorem. 

Theorem 3. Let £ be the collection of {L, 00) loss functions, and assume that Z is optimally locally 
differentially private (Definition\^, attaining a-differential privacy for the set £. Let d > 2 and 
assume a < 5/4. Then 

a 6l\Jn 

Remarks: We make a few remarks on Theorems [H O and El First, we note that, when reduced 
to the special case of having no random distribution Q, Theorems [1] and [2] each yield a minimax 
rate for stochastic optimization problems. Indeed, in Theorem (H we may take M^o = L, in which 
case (focusing on the second statement of the theorem) we obtain that for = {0 e M'^ : \\9\\-^^ < r}. 



Mirror descent algorithms [33, ISJ] can be used to minimize this class of loss functions, and their 
convergence rate matches this lower bound up to constant factors (also see our results in the sequel, 
as well as the explanation of Agarwal et al. (H). Thus, our result when specialized to this setting 
is unimprovable. Moreover, our analysis is sharper than previous analyses, as none of the existing 
lower bounds recover the logarithmic dependence on the dimension d, which is evidently necessary. 

In Theorem [21 if we take the constant Mi J, L, we see that 7—7-00 and consequently A(7) — )■ 1. 
Thus we obtain that whenever contains diH £qq ball of radius r, 

ltd y/n 

For this class of loss functions, the method of stochastic gradient descent attains a matching upper 
bound, again up to constant factors. (See Appendix C in Agarwal et al. fl, and also our results in 
the next section.) 
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Our second remark is that while our results appear to require disguising only gradient informa- 
tion, based on our communication setting in Section [2.11, this restriction is not actually substantial. 
Indeed, when the domain O is a norm ball we can establish each of our lower bounds using the loss 
function £{x,9) = {x,6). In this case, V^(x,0) = x, so that the communication scheme explicitly 
disguises exactly the individual data Xi. 

Finally, we have presented results for specific geometric properties of the loss functions These 
geometric properties are natural, as exemplified by our examples in Section [3?TJ It is, however, also 
possible to use our techniques to derive alternative results; such extension requires computing 
the optimal distribution attaining local privacy according to Definitions [1] or [21 then applying the 
lower-bounding techniques we develop in the sequel. 



3.3 Tradeoffs between privacy and statistical error 

We now turn to some consequences of Theorems [H [21 and [3] for the tradeoffs between rates of con- 
vergence for any statistical procedure and the desired privacy of a user. We present three corollaries 
that characterize this tradeoff. Looking ahead to Section [^ we may use Propositions [H and [21 and [3] 
to establish the results. For the mutual-information-based results (Theorems [T] and [21) , we apply 
apply the first two propositions to derive a bijection between the sizes M^o and Mi of the pertur- 
bation set and the amount of privacy, as measured by the worst case mutual information /*. We 
can then combine the lower bounds of Theorems [T] and [2] with results on stochastic approximation 
to obtain the tradeoffs. For differentially-private algorithms — whose lower bound is provided by 
Theorem [3] — we can show an upper bound on the necessary magnitude of the gradient bound Mqo 
to allow a-differential privacy, again applying known stochastic approximation results. We provide 
the full proofs in Sections 15.71 15. 8| and 15.91 respectively. 

In each of our corollaries, the upper bound is attained by (a variant of) mirror descent (3^,0, [3], 
which is a non-Euclidean generalization of the stochastic gradient method jsS, 3^, 4^. Recall that 



stochastic gradient methods are iterative methods that update a parameter 9^ over iterations t 
of an algorithm using stochastic gradient information. In particular, at iteration t, the algorithm 
receives a vector gt € M'^ with conditional expectation K[gt \ 9^] G dR{9^), then performs the update 



argmm{r]{gt,9) + ^{9,9^)} . 
eee 



Here is a step-size and ^ is a Bregman divergence, which keeps 9^^^ relatively close to 9^. (See 
the papers 0, 34 1 for further details.) With appropriate choice of the mirror descent algorithm 



enjoys the following convergence guarantees. Define 0„ = ^X]r=i^*- ^[115*1100 I — ^oo 
t and Q is contained in the £i-ball of radius ri, then with appropriate choice of ^ and ij 



E[Ri9^)]-Ri9n = 0{ ''-''^' ). (14a) 



See, for example. Beck and Teboulle [1, Section 5] or Nemirovski et al. 34, Section 2.3]. Similarly, 
with the choice ^{9, 9') = \\9 - 9'\\l, if E[||5t||2 | 9^] < M| and 6 is contained in the ^2-ban of radius 
r2, then 

E[R{9n)] - R{9*) = O (^) ■ (14b) 



For instance, see the references [49|, |34| . Section 2.2] for results of this type. 
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Using the mirror descent algorithm, we can estabUsh the foUowing: 



Corollary 1. Under the conditions of Theorem {l](b), assume moreover that M^o > 2L and Q* 
satisfies optimal local privacy at information level I* . Then for universal constants < q < c„. 
the minimax error is sandwiched as 

Vd rL^logd *iar^^^ "/d rL^logd 
ci^= 1= — < enl-^^©) < c«- 



It is worth noting that similar upper and lower bounds can be obtained under the conditions 
of part (a) of Theorem [H again by using mirror descent, but we lose a factor of ^J\ogd in the lower 
bound. (There is an additional factor of d in the statement (a), and 5 {0 S M"^ : ||^||j^ < r/d}.) 
In this case we would not need to assume that is an ^i-ball for the lower bound. 

We now turn to an analogous result, but based on an application of Theorem [2] and Proposition [2j 

Corollary 2. Under the conditions of Theorem\^ assume that Mi > 2L and Q* satisfies optimal 
local privacy at information level I* . Moreover, suppose that G is contains an toQ-ball of radius c\r 
and is contained in an ioo-ball of radius C2r, where < ci < C2 are constants. Then for universal 
constants < q < Cu, the minimax error is sandwiched as 

Vd rUVd */ar^\/ ^ rlVd 



I* VI* V^T- 

Finally, we provide a corollary to Theorem [Sj which gives us sharp tradeoffs in the differentially 
private case. 

Corollary 3. Under the conditions of Theorem\^ assume that Q* satisfies Definitions^ attaining 
a- differential privacy. Then for universal constants < q < c^, the minimax error is sandwiched 
as 

Q < e„(-C,0) < Cn- 



a \ n a \ n 



As noted above, mirror descent (or stochastic gradient descent) achieves the upper bound in 
each of the corollaries. We also note the difference in the dimension dependence in the convergence 
rates given by Corollaries [U and [3] and that given by Corollary [2j In particular, the former two 
have a dependence on dimension growing as Vd, while the latter depends on d. This is somewhat 
intuitive: under the conditions of Theorems Mb) and [Sj we are in a high-dimensional regime 
with a small set (see, e.g. 0, Section 5], [sj Chapter 5], or 34, Section 2.3]). So we expect 



weaker dimension dependence. In Corollary [21 any optimization method must essentially identify d 
different coordinates of a vector in [— r, rY, an loo ball, which causes slowness, and yields a scaling 
of VdrL I Vn even in the standard (non-private) minimax case [l| . Thus for both Corollaries [T] 
and [2] we see that incorporating privacy induces a penalty of roughly Vd/Vl* in convergence rate. 
The scaling differences for mutual information (as 1/Vl*) and differential privacy (as 1/a) are — as 
yet — incomparable, as there does not appear to be a simple mapping between information-theoretic 
notions of privacy and differential privacy. 
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4 Saddle Points, Optimal Privacy, and Mutual Information 



In this section, we explore conditions for a distribution Q* to satisfy optimal local privacy, as 
given by Definition [TJ We give a few characterizations of necessary (and sometimes sufficient) 
conditions based on the compact sets C C D for distributions P* and Q* to achieve the saddle 
point ([8]). Our results can be viewed as rate distortion theorems (with source P and 



channel Q) for certain compact alphabets, though as far as we know, they are all new. Thus, 
we refer to the conditional distribution Q, which is designed to maintain the privacy of the data 
X by communication of Z, interchangeably as the privacy-preserving distribution or the channel 
distribution. 

Note that since we wish to bound I{X; Z) for general losses i, as captured in the definitions of the 
source V (C) and communication set Q (C, D) in Eqs. ([9al) and (|9bp . we must address the case when 



21 



£{X,9) = {9,X), in which case V£{X,0) = X; this shows (by the data-processing inequality 
Chapter 5]) that it is no loss of generality to assume that X € C with probability 1 and that we 
must have E[Z \ X] = X. Thus we present each of our results assuming that i{X, 6) = {6, X), since 
a distribution Q* is optimally locally private or optimally differentially locally private if and only 
if it attains the saddle point with this choice of loss. 



4.1 General saddle point characterizations 

We begin with a general characterization, first defining the types of sets C and D that we use in 
our characterization of privacy. Such sets are reasonable for many applications (recall Section [3. ip . 
We focus on the case when the compact sets C and D are (suitably symmetric) norm balls: 

Definition 4. Let C C M'^ be a compact convex set with extreme points Uj G M"', i G / for some 
index set /. Then C is a rotationally invariant through its extreme points if ||tti||2 = II II 2 each 
i,j, and for any unitary matrix U such that Uui = Uj for some i ^ j, then UC = C. 

Some examples of convex sets rotationally invariant through their extreme points include £p-norm 
balls for p = 1, 2, 00, though £p-balls for p {1, 2, 00} are not. 

The following theorem gives a general characterization of the minimax mutual information for 
such rotationally invariant sets by providing saddle point distributions P* and Q* . We provide the 
proof of Theorem [4] in Section ID. II 

Theorem 4. Let C he a compact convex polytope rotationally invariant through its m < 00 extreme 
points {ui}"^^ and D = {1 + k)C for some k > 0. Let Q* be the conditional distribution of Z \ X 
that maximizes the entropy H{Z \ X = x) subject to the constraints that 

Eq[Z \ X = x\=x 

for x £ C and that Z is supported on (1 + a)ui for i = 1, . . . ,m. Then Q* satisfies DefinitionUl 
optimal local privacy, and Q* is (up to measure zero sets) unique. Moreover, the distribution P* 
uniform on {ui}^^ attains the saddle point ([8]). 



Remarks: We make a few brief remarks here, deferring a somewhat deeper discussion of the 
implications of Theorem [5] to Section ID. II as an understanding of the proof helps. While in the 
theorem we assume that Q*{- \ X = x) maximizes the entropy for each x £ C, this is not in fact 
essential. In fact, we may introduce a X' between X and Z: let X' be distributed among the 
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extreme points {ui}"^-^ of C in any way such that E[X' | X] = X, then use the maximum entropy 
distribution Q*{- \ Ui) defined in the theorem when X G {ui}YLi to sample Z from X' . By using 
the convexity of the negative entropy as in the bound (j44p in the proof of Theorem [J] (reahy, the 
information processing inequahty [2l|, Chapter 5]), this Markov chain X ^ X' ^ Z guarantees at 
least minimax information I{X\ Z) < infg supp I{P, Q). 



4.2 Specific saddle point computations 

With Theorem S] in place, we can explicitly characterize the minimax mutual information for ii and 
^oo balls by computing maximum entropy distributions. That is, we show the unique distributions 
that attain optimal local privacy — the distributions that guarantee as much (of our definition of) 
privacy as possible subject to certain constraints. We present two propositions in this regard, 
providing some discussion and giving proofs in Sections ID. 21 and ID.3[ 

First, consider the case where X G [—1, 1]^ and Z G [— M, M]"'. For notational convenience, we 
define the binary entropy h{p) = —plogp — {1 — p) log(l — p). We have 

Proposition 1. For a constant M > 1, let X G [—1, l]'^ and Z G [— M, M]*^ be random variables 
such that K[Z \ X] = X almost surely. Define Q* to be the conditional distribution on Z \ X such 
that the coordinates of Z are independent, have range {—M,M}, and 

Q*(Zi = M \ X) = - + — and Q*(Zi = -M\X) = --—. 
^ ^ ^ ' 2 2M ^ ^ ' ^ 2 2M 



Then Q* satisfies Definition{l\ optimal local privacy, and 



moreover. 



sup/(P,Q*) = d-d./.Q + ^ 

Before continuing, we give a slightly more intuitive understanding of Proposition [TJ Concavity 
implies that for a, 6 > 0, log(a) < log 6 + b~^{a — 6), or — log(a) > — log(6) + — a), so 

, /^l 1 \ , 1 1 , A 1 \ 1 1 

- log > - log - + 2 • and - log - H > - log 2 • . 

2M J - ^2 2M ^\2 2M)- ^2 2M 

In particular, we see that 



2 + 2m)^-[2 + 2m) (,-^°s2-Mj-U"^J l-^°^'+Mj=^°S'-M^- 
That is, we have for any distribution P on X, where X G [—1, l]'^, that (in natural logarithms) 

and this bound is tight to 0{M~^). 

We now consider the case when X G {x G M*^ : < l} and Z G G M'^ : ||z||j^ < M}. Here 
the arguments are slightly more complicated, as the coordinates of the random variables are no 
longer independent, but Theorem H] still allows us to explicitly characterize the saddle point of the 
mutual information. Before stating the proposition, we recall that if G are the standard basis 
vectors, then the extreme points of the ^i-ball of radius 1 are the 2d vectors {ibej}^^]^. 
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Proposition 2. For a constant M > 1, let X G {x G M'^ : ||x||^ < 1} and Z e {z eW^ : \\z\\^ < M} 
be random variables. Define the parameter 



/^ 2d-2 + V(2d-2)^ + 4(M^-l) >^ 
^ ■= [ 2iM^) J ' ^ ^ 

and let Q* be the conditional distribution on Z \ X such that Z is supported on {zbMeijf^^^, and 

Q*(Z = Mei\X = ei) = -, (16a) 

^ ^ ' ^ e-)- + e-T + (2(i - 2) ' ^ ' 

Q*(Z = -Mei\X = ei) = ^— -, (16b) 

^ ^ ' ^ + e-r + (2(i - 2) ' ^ ' 

Q*{Z = ±Me^\X = ei,j ^i) = ^— -. (16c) 

J I ^ ^ e7 + e-T + (2(i-2) ^ ^ 

(^For X {zbcj}, define X' to be randomly selected in any way from among {ifij} such that 
E[X' I X] = X, then sample Z from X' according to (|16ap - (jl6cp . ) Then Q* satisfies DefinitionUl 
optimal local privacy, and 

sup/(P, Q*) = log(2d) - log (e^ + e"^ + 2d - 2) + 7 7 . 

^ ^ ^ ' 'e^ + e-y + 2d-2 e^ + e-^ + 2d-2 

Proposition [2] is somewhat more complex than the case. We remark that the additional 
sampling to guarantee that X' E {icj} (where the conditional distribution Q* is defined) can 
be accomplished simply: define the random variable X' so that X' = ejsign(xj) with probability 
\xi\/ Evidently E[X' | X] = x, and X ^ X' ^ Z for Z distributed according to Q* defines a 

Markov chain as in our remarks following Theorem [H An asymptotic expansion allows us to gain 
a somewhat clearer picture of the values of the mutual information, though we do not derive upper 
bounds as we did for Proposition [TJ We have the following corollary, proved in Appendix [El 

Corollary 4. Let Q* denote the conditional distribution in Proposition \^ Then 

TlTyn*^ d ^^f . ( d^ log\d) 



4.3 Saddle points for differentially private communication 

Our final result in this section characterizes saddle points for distributions satisfying Definition [2j 
Such calculations are, in general, non-trivial, so we restrict our attention to results necessary for 
the setting of Theorem [3j To that end, we focus on the case where C and D are i^o balls, which is 
relevant for high-dimensional statistical and optimization settings. Without loss of generality (by 
scaling), we may take C = [— 1, l]'^ and D = [—M, M]'^. We have 

Proposition 3. For a constant M > 1, let X G [— 1,1]*^ and Z G [—M,M]'^ be random variables 
such that E[Z \ X] = X almost surely. Fix any x G {-1, l}'^ and for /c = {0, 2, 4, . . . , 2 \d/2] - 2} 
define the constants > and q^ > to satisfy the linear equations 

ze{~l,l}'^:(z,x)>k ze{~l,iy'-:(z,x)<k 
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Set k* = argmin^{g^/g^ }. Define Q* to be the distribution supported on {—M, M}'^ with probability 
mass function defined by 

Q*{Z = Mz\X = x) = h^: '^5'' "J;'; (17) 

[q^, if{z,x)<k* 

for z,x & l}'^- (For X { — 1, l}'^, define X' to be randomly chosen from { — 1, l}'^ such that 
E[X I X] = X, then sample Z according to the above p.m.f.) 

If k* is unique, then Q* uniquely satisfies Definitions^ optimal local differential privacy. If k* 
is non-unique, any distribution satisfying optimal local differential privacy is in the set of all convex 
combinations of distributions Q* defined via (jl7p for k minimizing q^ /q'^ . 

The proof of Proposition [3] is technical, and we defer it to Section [0.41 We make a few remarks, 
however. First, we provide a simphfied explanation of the the linear equations in the proposition. 
By symmetry, no matter the value of x € { — 1,1}'^ chosen, the same q'^ and q'^ solve the linear 
equations. Proposition [3] shows the structure of the distribution attaining optimal local differential 
privacy. That is, the proposition shows that the distribution Q*{- \ x) assigns mass only on the 
points z G {— M, M}'^, and moreover, it assigns one of two masses: either q^ or q~ . Whether a 
point z G { — M, M}'^ is assigned the higher or lower mass depends on its agreement with the initial 
point X being perturbed, that is, whether {z, x) > k/M or {z, x) < k/M. 

Thus, for a fixed level k, the amount of differential privacy a attained is given by e" = q^ /q^ , 
so that to find the most differentially private distribution, we calculate the minimizing k for q^ jq'^- 
Note also that q^ is a non-decreasing function of A:, and that while the minimizing k may be 
non-unique, if ajb = c/d, then we have for any A G [0, 1] that a/c = b/d, so 

Xa + {1- A)c _ Xbc/d+{l - A)c _ c(AVd + (1 - A)) _ c_a 
Xb+{1-X)d ~ Xad/c + {l- X)d ~ d{Xa/c + {1 - X)) ~ d ~ b' 

In particular, the convex combination of a-differentially private distributions from Proposition [3] is 
precisely a-differentially private. So Proposition [3] gives a mechanical way to compute the possible 
set of distributions satisfying optimal local differential privacy. 

5 Proofs of Statistical Rates 

In this section, we prove Theorems [H [21 and [3] as well as Corollaries [H [2l andJHl Our proofs build on 



classical information-theoretic techniques from statistical minimax theory [45|, |46|] as well as some 
intermediate results due to Agarwal et al. [ij]. At a high level, our approach is as follows. Beginning 
with an appropriately chosen finite set V, we assign a risk functions to each member v £ V. The 
resulting collection {i?„}^gv of risk functions is chosen so that they "separate" points in the set V, 
meaning that if G G is a point that approximately minimizes the function R^, then for any w ^ v, 
the point 9 cannot also be an approximate minimizer of Rw This separation property allows us to 
deduce that statistical estimation implies the existence of a testing procedure that distinguishes v 
for w ^ V. We then use Fano's inequality to obtain a lower bound on the testing error, so that the 
final step is to obtain good upper bounds on the mutual information between the random variable 
Xi and the vector Zj communicated. 
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5.1 Reduction to testing 

We begin by describing the reduction from bounding the minimax error to a testing problem. It 
assumes a given collection of risk functions {Ri,}veV indexed by a finite set V; see Section [5^ to 
follow for discussion of the particular collections used in our analysis. For each v £ V, we choose 
some representative 9* £ argmini?„(0) of the set of all minimizing vectors. Our reduction is based 

n 

on a discrepancy measure between pairs of risk functions, first introduced by by Agarwal et al. [Ij, 
defined as 

p{R,, R^) := inf [R„{e) + R^{9) - RM " Rw{0*J] , (18) 

The /9-separation of the set V is given 

p*{V) := mm{p{R^,,R^) : v,w e V,v ^ w} . (19) 

When the set V is clear from context, we use p* as shorthand for this separation. The key to the 
definition (119p is that the separation allows us to lower bound the expected optimality gap of a 
statistical method Ai by the probability of error in a hypothesis test. First, note that for any 
9 £ @, there is at most one v £V such that Rv{9) — Rv{9y) < p*/2. Indeed, if this inequality holds 
for both V and w ^ v, 

P^V) < R,{9) + R^O) - Rv{9;) - RU0*J < P*{V), 

a contradiction. The following result is a variant of Lemma 2 from Agarwal et al. [l[ : 

Lemma 1. Let P be a joint distribution over X andV £V such that X are i.i.d. given V 

and 

Ep [t{X,9) \ V = v] = R^{9). 

Let Q he the conditional distribution of Z given the subgradients di{X,-). For any minimization 
procedure Ad, one may construct a hypothesis test v{Ai) : (Zi, . . . , Z„) — )• V such that 

Ep,Q [en{MJ, e, P)] > ^Pp,Q [v^V]. 

In particular, if we can bound the probability of error of any hypothesis test for identifying V based 
on stochastic subgradient samples Zi, . . . , Z„,, then we have lower bounded the rate at which it is 
possible to minimize the risk R. 

In order to prove a lower bound on the error of a hypothesis testing problem, we apply Fano's 
inequality @]. Let V £ V he chosen uniformly at random from V. If a procedure observes random 
variables Zi, . . . , Zn, Fano's inequality ensures that for any estimate v of V — that is, any measurable 
function v of Zi, . . . , Z^ — the test error probability satisfies the lower bound 

P(^(Z„ . . . , Z.) / y) > 1 - ^(^i>---;^n.^)+log2 _ 

log I V| 

Using the lower bound provided by Lemma [T] and Fano's inequality ()20p . the structure of our 
remaining proofs becomes more apparent. Each lower bound argument proceeds in three steps: 

(1) We construct a collection of loss functions satisfying Definition [3l computing the minimal 
separation ([19]) so that we may apply Lemma [TJ (See Sections [5?2]lHE231) 
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(2) To be able to apply Fano's inequality (j20p . we provide an upper bound on the mutual infor- 
mation I{Zi, . . . , Zn', V) for our specific choice of loss from step [H To do so, we use the fact 
that for each of Theorems [H [21 and [3l we used a distribution Q that satisfies our Definition [J 
of optimal local privacy; this requires some subtlety in providing the bound. (See Lemmas \5\ 
El [3 and El in Section [Ol) 

(3) The final step is to use the results of steps [1] and [2] in the application of Lemma [1] and Fano's 
inequality (I20p . This then yields the theorems. 

We provide the formal proofs of Theorems [H [2l and [3] in Sections 15.41 15.51 and EiH respectively; the 
next two sections are devoted to steps [H and [21 



5.2 Collections of loss functions 

In this section, we construct three example sets of functions, each yielding a different collection of 
risks, enumerating their separation properties to be able to apply Lemma [H 



5.2.1 Linear Losses 

Our first collection of risk functionals is relatively simple, based on families of linear loss function. 
Q. Assuming that the random variables X take values in W^, we define the linear loss functions 

d 

£{x,e) := {x,e) = Y^x.e,. (21) 

For this collection of loss functions, we let V = where the vectors Cj are the standard 

basis vectors in M"', whence |V| = 2d. We also fix a J G (0, 1/4], which we specify later, and choose 
the distribution P on X so that the final risk is equal to 

R,{e)=^p[{e,x)] = ''-^{v,e). (22) 



We choose the constant c so that the linear loss functions (j22|) belong to the appropriate loss class. 

To construct a risk of the form (|22p . we draw the random vector X ^ conditional on the 
parameter v, choosing X from among the 2'^ vectors in the scaled hypercube {— c, c}'^ — viz. 

c/d w.p. 



Choose X G {— c, c} with independent coordinates, where Xj = I i-Sv (^3) 

I —c/d w.p. — 



Under the sampling strategy ([231) . when v = ±ej, the coordinate Xj is independent and uniformly 
chosen from {—c/d, c/d} for j ^ i. Additionally, we have that K[^(X, ^)] = Ry(9), and moreover: 



Lemma 2. In the sampling scheme (I23p . with c = Ld: 

(a) The loss ()2ip is L-Lipschitz with respect to the oo-norm. 

(h) For the optimization domain Q = {0 : \\6\\^ < r}, the p-separation of the setV = {^ei}^^^ 
is p*(V) = Lr5. 
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Proof The first statement of the lemma is immediate. For the second, we compute minimizers. 
Indeed, by definition of the dual norm, we see that for v £V, 

inf R,{e)= inf {v,9) = --r\\v\\^ = -L6r, 

\\S\\i<r \\S\\i<r d d 

and the minimizer is uniquely attained at 9^ = —rv. Then we have for any w ^ v that 

inf [{v + w, 6)] + \\v\\^ + = - ||^^ + ^^IL + II^^IL + \Moc > -1 + 1 + 1 = 1, 

II'' II 1 — : J- 

since no identical coordinates of v and w have the same sign. Multiplying the result by Lrd com- 
pletes the proof. □ 



5.2.2 Hinge (SVM) Losses 

We now turn to families of losses that are useful for analyzing the case of stochastic subgradients 
bounded in ^i-norm. Let V C { — 1, 1}'^ be a subset of the binary hypercube such that for all distinct 
pairs V ^ v' , we have \\v — v'\\i > d/2, or equivalently — v'\\q > d/A. From the Gilbert-Varshamov 
bound (e.g. (46l . Lemma 4]) there are sets of this form with cardinality at least card(V) > exp((i/8). 
For a fixed constant c > 0, we define the hinge loss 

£{x,e) = c[r- {x,e)]^. (24) 

As our sampling process for the data, we choose X from among the 2d positive and negative 
standard basis vectors ±ej — namely 

l-5v (25) 

-Cj W.p. 

where 5 £ (0,1/4] is fixed. The combination of hinge loss ()24p and sampling strategy (|25p yields 
the risk function 

:= \ V - ^)]+ + \ E + (^^•' ■ (26) 

Assuming that contains the l^a ball of radius r, the (unique) minimizer of the risk over is 

6** := argmini?t,(6') = rv £ r{-l, l}'^ C 6. 

Moreover, this risk has the following properties: 
Lemma 3. For any set ^ [—r,r]'^, we have: 

(a) For P with support suppP C {x G M'^ : ||a;||-|^ < 1}, the loss function £{x,9) = c[r — {9,x)]_^ is 
c-Lipschitz with respect to the ii-norm. 

(h) Ifv,w£V with V ^ w, the discrepancy p{Ry,Rw) > rc5/2. 
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Proof The first claim is immediate (e.g. |24i|). since 0)|| < c||x||j^ < c. For the second 

statement of the lemma, we see that the minimum of 

d J- 



r-vj=wj 



is attained by any 9 £ with 0j E [— r, r] for j such that Vj / tfj and 6*^ = rvj for j such that 
Vj = Wj. Thus we have 

M{R,{9) + RM} - Rv{e:) - Rn.{e*J = ^Y.'^r - ^ J2 rS - cr{l - S) - cr{l - 6) 



2cr — 2cr + 2cr(5 



d 

3.Vj=Wj 

2cr5 



{d- \\v-w\\(.) 



2cr6 



d -"0' d 

Since Hz; — w;||q > (i/4 by construction, we have p{Ry,Rw) > rc5/2, as desired. 



V — w 



• 



□ 



5.2.3 Median-type Losses 

We now describe a class of median-type losses, one with more general applicability than the linear 
losses of Section [5. 2.11 As in Section [5.2.21 let V C {—1, l}"^ be a d/4-packing of the hypercube in 
•^o-norm. For a given 5 € (0, 1/4], define the risk function 

Rv{0) := ^ E \0-r\ + \e + r\=-[-^\\e- tv\\^ + ^ 11^ + j • (27) 

j=i 

By construction, whenever contains the Coo ball of radius r, this risk function has the unique 
minimizer 

(9* := argmin ii„((9) = rv £ r{-l, I}"' C G. 
The risk (I27p can be realized as the expectation of the median loss function 

l{X,e) = ^\\X -e\\^, (28a) 

and a sampling scheme of the form 

(28b) 

—r w.p. — 

With these choices, it is straightforward to verify that Rv{0) = K[£(X,6)]. The following lemma, 
due to Agarwal et al. [H, captures the separation properties of the collection {i?t,},ugv of risk 
functionals: 

Lemma 4. Assume that Q contains [— r, r]"' and let R^ be defined by the risk (|27p . If v,w G V 
with V ^ w, the discrepancy p{Ry,Ruj) > rc6/2. 

As a final remark, for random variables X G M*^, the loss function (|28ap is Lipschitz continuous 
(for appropriate choice of c) for any distribution P on X. Specifically, defining the sign(-) function 
coordinate-wise, we have the subgradient equality di{x, 0) = {c/d) sign(0 — x). Thus, choosing p, q 
to satisfy 1/ q + \/p = 1 and c = Ld^/'^ yields a member of the collection of (L,p)-loss functions. 
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5.3 Mutual information bounds and hypothesis testing 



We now return the main thread of the proof. From inspection of Fano's inequaUty (j20p . we see 
that it involves two quantities: (i) the log cardinality log |V|, and (ii) suitable upper bounds on the 
mutual information term. Since the log cardinality was specified during construction of our loss 
families in the preceding section, it remains to address the latter sub-problem. 

Recall that Zi, . . . , Z„ are unbiased subgradient estimates of the loss 9 i— )• ^(Xj,^), where Xi 
are independent samples according to a distribution P[- \ V). We assume that the samples Zi are 
conditionally independent of V given Xi and the parameters 9; this assumption is natural since Z 
is a random function of di{Xi,0). Our goal is to upper bound the mutual information between the 
sequence Zi, . . . , Z„ of observed (stochastic) gradients and the random element V £ V. 

From Propositions [T] and [21 we know that the channel distributions Q guaranteeing privacy 
are supported on a finite set: in the case of p = 1, on (a multiple) of the standard basis vectors 
{ibej}^^^, and for p = oo, on (a multiple of) the corners of the hypercube {—1,1}'^. Thus (using 
the chain rule for mutual information [3]) we have the decomposition 

n 

I{Zi, ...,Zn;V) = Y, [H{Zi \Zi,..., Z,_i) - H{Zi \V,Zi,..., Z,_i)] . 

i=l 

Let 9i denote the point at which the ith. gradient is computed. Then by inspection, we must 
have 9i G cr{Zi, . . . , Zj_i). Since Zi is conditionally independent of Zi, . . . , given V and 6i 
and conditioning decreases entropy, we have 

H{Zi I Zi,. . . , Zi^i) — H{Zi I V,Zi,..., Zi^i) = H{Zi \ Zi, . . . , Zi^i) — H{Zi \ V, 9i) 

<H{Zi\ei)-H{z,\v,ei) 
= i{Zi-v\ei). 

In particular, letting Fi denote the distribution of we have 



n „ n 

I{Zi,...,Zn;V)<y] / I{Zi;V \ 9)dFi{9) <y^supI{Zi;V 



(29) 



We now state four lemmas, each bounding the mutual information between observed subgra- 
dients Zi and the random variable V, for different choices of loss function I and conditional dis- 
tribution Q. The proof of each lemma begins by using the bound ()29p to reduce the problem to 
estimating the mutual information I{Z; V \ 9) for a single randomized gradient sample Z. Then, 
careful calculation of the distribution oi Z \ V yields the final inequalities. As the proofs are 
somewhat long and technical, we defer them to Appendix [Bl 

Lemma 5. Let V be drawn uniformly at random from V = {^ei}f^i. Let X have the distri- 
bution (I23p conditional on V = v and assume 1{X, 9) = (X, 9) . Let Z be constructed accord- 
ing to the conditional distribution specified by Proposition \^ given a subgradient dl{Xi;6) with 
Z G [-Moc,Moof, where Moo > c/d. Then 

/(Zi,...,Z„;F)<n^. 

See Appendix IB. II for a proof of Lemma [5j 
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Lemma 6. Let V be drawn uniformly at random from a set V C { — 1, l}"^- Define the distribution 
P{- \V) on X to be such that the jth coordinate Xj = rVj with probability {l + 6)/2 and Xj = —rVj 
with probability {l — 5)/2, each coordinate independent of the others, where r > is a constant. Let 
the loss function i{X, 6) be given as follows: 

£{X,9) = ^\\9-X\\,. 

Let Z be constructed according to the distribution specified by Proposition [7] conditional on a sub- 
gradient dl{Xi]6), where Z G [—Mf^^M^Y" '^''^d M^o > c/d. Then 



2^2 



I{Zi,...,Zn;V)<n- 



^Mld- 

See Appendix IB. 21 for a proof of Lemma El 

Lemma 7. Let V be drawn uniformly at random from a set V C { — 1, l}'^- Define the distribution 
P{- \ A) on X as in the random sampling scheme ()25p and use the loss (j24p . Let Z be constructed 
according to the conditional distribution specified by Proposition^^ where Z G {z G M'^ : ||z||j^ < 
Ml}. Define M = Mi/c and the constants 



/^ 2d-2 + V(2d-2)2 + 4(M2-l , 
7 := log — — — and A(7) := 



2(M-1) / ' 67 + 6-7 + 2(^-1)' 

Then 

L{Zi,...,Zn-,V)<n6^A{^f 
We provide the proof of the lemma in Appendix IB.3I 

Lemma 8. Let V be drawn uniformly at random from V = {±6j}f^^. Let X \ V be sampled 
according to the distribution (I23p . and let Z \ X = x have support on { — 1, l}"^ and have p.m.f. 

q{z I x) oc 

for some k>0. Define the constants Cd{k) and A{5,a,d,k) by 



exp(a) if X > k 
1 if X < k 



r(d-fc)/2i-i 



Cd{k) := cardjz G {-1, l}'^ : {z,x) > k] = ('^) 



i=0 

and 

A{5, a, d, k) := 5 



d-l 

{e^ + l)Cd{k) + 2'^\\{d-k)/2\-l)- 

Then 

I{Z;V) < A{6,a,d,kf. 
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We provide the proof of the lemma m Appendix IB.4I 



In the proof of Theorems [T] and [21 we require one additional result for the cases when the 
dimension d is small; we apply the result instead of Fano's inequality. Specifically, we use Le Cam's 
method S^, 3], which provides lower bounds on the probability of error in binary hypothesis testing 
problems. In this setting, assume that V = {—1, 1} has two elements, and let ^ € V be chosen 
uniformly at random from V. If a procedure observes random variables Zi, . . . , distributed 
according to ii V = 1 and Q"^ ii V = —1, then any estimate v oi V satisfies the lower bound 

P (^(Zi, . . . , Z„) # > i - i - Ql,\\^^ . (30) 



See, for example, Yu [46|, Lemma 1] and Le Cam 30|, Section 2]. Moreover, we have the following 
lemma. 

Lemma 9. Let Qi and Q^i be distributions on {—1, l}, where 

mv \ ^ \ ^ '■^^ = 1 A n (7 \ 1 1 /-5 ^/z = l 

(4\{Z = z) = — \ — • < and Q-i[Z = z) = — I — • < 

2 2 1—5 otherwise 2 215 otherwise. 

Let Q'2 denote the n-fold product distribution of Qi. Then for 5 G [0, 1/3], 



||Q?-Q!^i||Tv<'^V(372)n. 
We provide the proof of the lemma in Appendix IB. 51 

Equipped with these auxiliary results, we are now ready to prove our main theorems. 
5.4 Proof of Theorem [1] 

We break the proof of Theorem [1] into three parts. In the first, we prove part (a) of the theorem 
assuming that the dimension d > 9. Next, we show part (a) for smaller values of the dimension, 
which requires Le Cam's bounding technique (f30l) . Finally, we prove part (b). Roughly, our strategy 
is to apply Lemma [1] and one of Lemmas [5] or [6] to achieve a lower bound on the rate of convergence 
of any estimation procedure. We first recall the beginning of the previous section, stating the 
following application of Lemma [T] and Fano's inequality (j20p : 

^ -Ep,g [en{M,l, e, P)] > Pp.Q {v{M) / y) > 1 - • • • ; ;f ^ + ^ (31) 



p*(V)^'^^ "^^ ' ' '^^ - ^•'^^ ^ ^ ^- log|V| 

Now we give the proof of the first statement of the theorem in the case that d > 9. Applying 
Lemmas m and m we immediately have the following specialization of the inequality (|3ip : 

4 loe 2 

— Ep,Q [en{M, £, e, P)] > 1 - ^4.^ - n- 



Taking the set V C {— 1, l}'^ to be a d/4 packing of the hypercube {—1, l}"^ satisfying |V| > exp((i/8), 
as described in Sections 15.2.21 and I5.2.3| we see that 

^.Ep,,MA^,Ae,P)]>l-^-n-^^'^' 
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By the remarks following Lemma [H we may take L = c/d. The numerical inequality 8 log 2 < 6 
coupled with the preceding bound implies 



4 „ „ 6 52^2 



Ep,Q [e„(7W,£,e,P)] > l---8n 



By our assumption that d > 9, if we choose 5 = Moo/8Ly/n, then we are guaranteed the lower 
bound ^i^Ep^Q [en{M,i,Q, P)] > |, or equivalently 

^P,Q [en{M,i,e,P)\ > — — = — 

20 160 ^/n 

When d < 9, we may reduce to the case that d = 1, since a lower bound in this setting extends 
to higher dimensions (though we may lose dimension dependence). For this case, we use the packing 
set V = {—1, 1} with the linear loss function from Lemma [2l which has p*(V) = Lr5. In this case, 
the marginal distribution Q(- \ V) is given by 



Q^Z = z\V = 1) = - + 



1 1^ ifz = M 



2 I — ^ otherwise, i.e. if z = — M. 



Now, let Q^{- I V) denote the distribution of Zi, . . . ,Zn conditional on V. Then applying Lemma[T] 
and Le Cam's lower bound ()30p . we obtain the inequality 

^ .Ep,Q[en{M,e, e, P)] > Pp,Q {d{M) / y) > i - i ||Q-(. \V = 1)- Q"(- | V = -1)||tv • 



By inspection, the distributions place us precisely in the conditions Lemma [9] specifies, so if 
6 < M/(3L), we have the bound 



^ Ep,Q[en{M,i,e,P)]>l-^-^4- (32) 



rL5 "'^^ ' ' ' - 2 2^/2 M 
Multiplying both sides by rL6, then setting 6 = M / {'iL^/n) < M/(3L), we have 



36^/2^ 20.3y/n' 

In turn, for any d < 8, we immediately find that 1/20.3 > d/163, which completes the proof of 
Theorem H^a) . 

For the second statement of the theorem, we use the linear losses of Section 15.2.11 and apply 
Lemmas [2] and [5] with the choice V = {±ej}^^]^. Since we are in the (L, oo)-Lipschitz class of loss 
functions, we take c = Ld in the sampling scheme ()23p . In this case, the lower bound (|3ip and 
Lemma Os separation guarantee imply that 

^Ep,«K(A,,.,e.P)l>l HZu...,z„:V) 



V n- j^g(2d) log(2d) 

By assumption that d > 2, we have log 2/ log(2(i) < 1/2, which, after an application of LemmaO 
yields 
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If we choose 5 = M^o y^log(2(i) /2Ly/n, we see that we have 



Lr5 -'-"'^ " - 4' 

which is equivalent in this case to 

^L6 _ 1 Moorv^log(2(i) 



16 Vn 



T i 

IEp,Q [en(A^,Ae,P)] > 
5.5 Proof of Theorem [2] 

The proof of Theorem [2] is quite similar to that of Theorem [H except that we apply Lemma [7] in 
place of Lemmas [5] or [6j Indeed, following identical steps to those in the proof of Theorem [U we 
see that with the packing V{— 1, 1}°' of size |V| > exp((i/8), we have 



rU " ' '•- log|V| log|V| 

a a 

Consequently, if we choose 5 = \/d/(8A(7)-^/n), then for all d > 9, we have the lower bound 
^Ep,Q[e„(A^,^,G,P)] > i, or equivalently 



20 160 V^^(7)' 
which completes the proof (as the case d < 8 is identical to that in Theorem [T]). 

5.6 Proof of Theorem [3] 

Since our optimization domain = G M"^ : ||^||^ < r}, we proceed similarly to our proof of 
Theorem [T] and use the linear losses of Section [5.2.11 Indeed, using the packing set V = {zbeijf^]^, 
we find that 



Lr6 ^'^ ' ' ' ' /J - log(2d) log(2d) 

as earlier. For any a < 5/4, we have e° — 1 < 2a, and by properties of binomial coefficients and 
Stirling's approximation we have 

If d-1 \ ^ 1 / d- 1 \ _ 1 



2<i y\{d - k)/2] -ij ^ y\d/2] -ij - y/d 

for any k. Now, for any distribution Q satisfying optimal local differential privacy at a differential 
privacy level q. Proposition [3] implies Q is a convex combination of distributions with p.m.f.s of the 
form in Lemma [H Applying the convexity of mutual information — taking a convex combination of 
channel distributions Q can only reduce mutual information — and Lemma [HI we thus obtain 



I{Zi, . . . ,Zn;V) < n max A(5, a, d, ky 

k>0 



<n52(e"-l)Vax — — ^ , ,,,, < ind^a'^ ■ - 

- ^ ' k \{e^ + l)Cd{k) + 2'i\\{d-k)/2^-l^ ^ - 



d — 1 \\^ .^ool 
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As a consequence, we have the lower bound 



,Ep,Q [en{M,£, e, P)] > - - max ' ' ' > - 



Lr5 "'^^"^ , , , .J _ 2 ^ log(2d) "2 dlog{2d)' 



By choosing 6 = ^^dlog{2d)/4a^/n, we find that 
which is equivalent to the bound given in the theorem. 



5.7 Proof of Corollary [T] 

Since C g M"^ : ||^||-^ < r}, the bound (|14ap guarantees that mirror descent obtains convergence 
rate 0{Moor\/\og dj y/n). This matches the second statement of Theorem [H Now fix our desired 
amount of mutual information /*. From the remarks following Proposition [H if we must guarantee 
that /* > supp I{P,Q) for any distribution P and loss function £ whose gradients are bounded in 
£oo-norm by L, we must (because of the uniqueness of the optimal privacy distribution Q) have 

Up to higher order terms, to guarantee a level of privacy with mutual information /*, we must 
allow gradient noise up to a level Moo = L\J djl* . The equality ()33p establishes that for a given 
level of allowed mutual information /*, if optimal local privacy holds, then we must have M^o ^ 
L\fdl\fp. That is, we have a bijection between I* and M^o whenever optimal local privacy holds, 
so substituting Mqo = L^/d/y/T* into our upper and lower bounds gives the corollary. □ 



5.8 Proof of Corollary [2] 

According to the conditions of optimal local privacy, if we must guarantee that /* > supp /(P, Q) 
for any loss function £ whose gradients are bounded in ^i-norm by L, we must have 

^ 2M2 ' 

using Corollary [4] after the statement of Proposition [2l Rewriting this, we see that we must have 
Ml = LyJd/2I* (to higher order terms) to be able to guarantee an amount of privacy /*. As in 
the 

•^oo case, we have a bijection between the multiplier M\ and the amount of information /* and 
can apply similar techniques. Now recall the convergence guarantee (|14b|) provided by stochastic 
gradient descent. Since the ^oo-ball of radius r is contained in the ^2-ball of radius r2 = r-v/d, and 
Iblli < 115112 foi^ 3-11 9 ^ ^'^1 stochastic gradient descent guarantees that e*(£, 0) < CM\r\fdl ^fn. 
Applying the lower bound provided by Theorem[2]and substituting for M\ completes the proof. □ 



5.9 Proof of Corollary [3] 

Without loss of generality (by scaling), we assume that L = 1. Now we consider Proposition[3l which 
characterizes the distributions satisfying optimal local differential privacy. We use the proposition 
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to find an upper bound on M^o in terms of the differential privacy level a, which in turn allows us 
to apply the bound from mirror descent (I14ap . Instead of directly using Proposition [3l it is simpler 
to use the linear program (j52p in its proof, and note that finding a lower bound on t (in the LP) as 
a function of a provides an upper bound on M^c since M^o = 1/t. Now, in the linear program (j52]l . 
we choose the values for q{z) specified by Lemma [TBI Let and q- denote the larger and smaller 
probabilities, respectively. Fix an x G {—1, l}"^, and let z range over {—1, l}*^. With those choices, 
we note that for d odd, 

E ^= E ^+ E ' + E ' 

z:{z,x)>0 z:(z,x)=l z:(z,x)=3 z:{z,x)=d 



d-1 

2 



d-l 

d+1 
2 



X + 



d-l 

d+1 
2 



d-l 

d+3 
2 



d-l 

d-l 
2 



X. 



For d even, a similar calculation yields Yl 



z:{z,x)>0 ■ 



(d~l^ 
\d/2j 



X. As a consequence, we find that 



Y,zq[ 



z X] 



f+ E ^ + 9- E ^ = ^(^+ 

z:{z,x)>Q z:{z,x)<Q 




d odd 
d even. 



Focusing on the odd case for simplicity — identical bounds hold in the even case — we have for a 
universal constant c > that 



d-l 



1 



d-l\ e" 
> c- 



1 1 a 
> c- 



+ 1) V ^ 7 " ^e" + 1 " "Vd' 



the first inequality following from Stirling's approximation and the second from convexity of the 
function a i— t- e°. In particular, we see that the minimizing value t in the linear program (I52p will 
satisfy t > ca/ which in turn yields M^o = 1/t < Vd/ (ca) . Noting that the lower bound in the 
corollary is given by Theorem [3l applying the convergence guarantee (I14ap of mirror descent based 
on Moo completes the proof. □ 



6 Discussion 

We have studied methods for protecting privacy in general statistical risk minimization problems, 
in particular techniques that maintain privacy between the data Xi, . . . ,X„ and the estimation 
method Ai. As a consequence of our focus, we were able to provide a general technique for 
obtaining sharp tradeoffs between privacy protection and estimation rates, which are a natural 
measure of utility for statistical problems. 

We believe that there are a number of remaining open issues and areas for future work. First, 
we studied procedures that access each datum only once, and through a perturbed view Zi of 
the subgradient di{Xi,9), which allowed us to use (essentially) arbitrary convex losses. A natural 
question is whether there are restrictions of the class of loss functions so that a transformed version 
(Zi, . . . , Zn) of the data are sufficient for inference. For instance, Zhou et al. 
cations in which a data matrix X = [Xi ■ ■ ■ Xn]^ G M"^"^ is pre- multiplied by a normal matrix 
$ € M*"^", where m <^ n, and statistical inference is performed using ^X. For problems such as 
linear regression and PCA, the resulting estimators enjoy good statistical properties. This transfor- 
mation, however, cannot be computed without the entire dataset at one's disposal. Nonparametric 



47|, |41 study appli- 
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data releases, such as those studied by Hall et al. [24], could provide insights here, though again, 
current approaches require the data to be aggregated by a trusted curator before release. 

Our constraints on the privacy-inducing channel distribution Q require that its support lie 
in some compact set. We find this restriction useful, but perhaps it possible to achieve faster 
estimation rates if all we require are moment conditions, for example, Eq[||Z — X\\^ \ X] < K'P . A 
better understanding of general privacy-preserving channels Q for alternative constraints to those 
we have proposed is also desirable. Moreover, one might consider attempting only to guarantee 
that (t){X) is private, where (f) is some (known) function. For example, members of a dataset may 
not care if their genders are known, but more personal features of X may be more sensitive. 

These questions do not appear to have easy answers, especially when we wish to allow each 
provider of a single datum to be able to guarantee his or her own privacy. Nevertheless, we hope 
that our view of privacy and the techniques we have developed herein prove fruitful, and we hope 
to investigate some of the above issues in future work. 
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A Unbiasedness 

In this appendix, we show that if an optimization procedure receives biased subgradients it is 
possible to be arbitrarily wrong. In particular, we do so by constructing a simple problem instance. 
Fix a bias 6 > and consider the following one dimensional problem: 

hO 

minimize f{9) := subject to 6 G [— c, c]. 

If a gradient oracle returns biased gradients of the form —6/2 at each point 6 S [— c, c], it is impos- 
sible to distinguish the objective from —hO/2. The minimizer of this objective is ^bias = sign(6)c. 
The true optimal point \s 6* = — sign(6)c, yielding the worst possible error 

/(^bias) - /(n = sup f[e)- inf f{e). 
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We can show this more formally using an information theoretic derivation similar to that in Sec- 
tion [5l Omitting details, the argument is as follows. In the notation of Section [5l if a bias is 
chosen independently of the parameters v € V oi the risk R^, then there is a bounded amount 
of mutual information that can be communicated to any optimization procedure. Consequently, 
Fano's inequality (|20p guarantees that the estimation accuracy of any procedure must be bounded 
away from zero. 

B Calculation of the Mutual Information for Sampling Strategies 

This appendix is devoted to the proofs of Lemma O Lemma [6l and Lemma [71 The proofs of the 
latter two require a minor lemma, which we present here before giving the proofs proper. 

Lemma 10. Let 1 > p > 5 > and p + 6 < 1. Then 

{p + 5) log(p + 6) + {p- 5) log(p - 5) > 2plogp. 

Proof Since the function p i— )■ f{p) = plogp is strictly convex over [0, oo), we may apply 
convexity. Indeed, p = ^{p + S) + ^{p — 6), so 

plogp = f Q(|j + 6) + ^{p - 6)^ < ^f{p + 5) + ^f{p - 6), 
which is the desired result. □ 



B.l Proof of Lemma [5] 



It is clear that the subgradient set d£{Xi; 9) is independent of 0, so we may use the inequality ([29]) 
to bound the mutual information of V and a single sample Z. Define M = Mood/c. Since the 
sampling scheme (|23p is independent per-coordinate, we see immediately that if Zj denotes the jth. 
coordinate of Z then 



I{Z;V) = H{Z)-H{Z I V) < dlog(2) - ^ | V). 

Since V is uniformly chosen from one of 2d vectors, we additionally find that 



I{Z;V) < d 



By the choice of our sampling scheme for X and Z, we see that H{Z \V = v)is identical for each 
V gV, and we have 

Q{Zj = Moo I Vj = Vj = 0) = ^, and Q{Zj = -M^ \ Vj = Vj = 0) = ^. 
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On the other hand, by our choice of samphng scheme, for the "on" index in V, we have 

Q{Zj = Moo I Vj = Vj = 1) = Q{Zj = Moo I Xj = c/d)P{Xj = c/d \ Vj = vj = 1) 

+ Q{Zj = Moo I Xj = -c/d)P{Xj = -c/d I Vj = Vj = 1) 
M + l\ fl + 6\ ^/M-l\ /l-d\ _l _^ 5 



2M J \ 2 J \ 2M J \ 2 J 2 2M 
Consequently, defining the Bernoulli entropy h{p) = —plogp — {1 — p) log(l — p), then 

I{Z;V) < d 



log2--^(2d-2)log2 + 2/.^- + — 

The concavity of the function p i— log(p) yields that log(l/2 + p) < log(l/2) + 2p, so 

HZ: < l„g2 + (i + A) ,„g2 + i^) + (i - ^) (- log 2 - A) = ^, 

Making the substitution M = Ad^od/c completes the proof. 
B.2 Proof of Lemma [6] 

By using the inequality (|29|) . a bound on the mutual information I{Z; V \ 9) implies a bound on the 
joint information in the statement of the lemma, so we focus on bounding the mutual information 
of a single sample Z. In addition, it is no loss of generality to assume that r = 1. 

Define M = M^djc to be the multiple of the ^oo-norm of the subgradients that we take, and 
let Zj denote the jth coordinate of Z. Using the coordinate-wise independence of the sampling, we 
have 

d 

i{Z;V I e) = H{z I e)-H{z \v,e)< diog{2) - ^H{Zj I Vj,ej). 

Now consider the distribution of Zj given Vj and Oj. By symmetry, the distribution has identical 
entropy for any value of Vj , so we may fixV = v and assume Vj = without loss of generality. Then 
for 6j G (—1, 1), the jth component of the subgradient d£{X; 6) is —Xj, whence we see that 

QiZj = M^ I Vj = 1,%) 

= Q{Zj = Moo I Xj = 1, ej)P{Xj = Moo I Vj = 1) + Q{Zj = Moo I Xj = -1, ej)P{Xj = -1 I Vj = 1) 

M-l\ /l + (5\ /M + 1 

+ 



2M J \ 2 J \ 2M 
2M -26 1 6 



4.M 2 2M 



Similarly, Q{Zj = —Moo I Vj = l,Oj) = ^ + If 9j > 1, then we have that the subgradient 
d\9j — Xj\ = 1 with probability 1, and thus 

=M„ I = 1,0,) = — — + __=-, 
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which increases the entropy H{Zj \ Vj,6j) by Lemma [TOl Thus we see that 9j S (—1,1), yielding 
the Bernouhi mare; inal (i + 6/2M, \ - 5/2M) on Zj \ Vj, has the smallest entropy H{Zj \ Vj,9j). 
Summarizing, we have 



I{Z;V\ 9) < dlogi2) + d 



l^AVogfi + AUri-A^wri-A^ 



2 ^ 2M 



+ 



2 2M J \2 2M 
As in the proof of Lemma O we use the concavity of log to see that 

" 1 6 



log 



2 2M J 



I{Z;V I 9) < dlog(2) + d 



2 + ^ ) (- log(2) + VM) + ( 7T - TTITF ) (- log(2) " V^) 



^1- + -) f- 

\2 2M y \M 



id 



5 

2 2M 



2~ 2M 
M j ~ M2 



Applying the bound (|29p and replacing M = Mood/c completes the proof. 
B.3 Proof of Lemma [7] 

Letting Z denote a single subgradient sample using the conditional distribution Q specified by 
Proposition O we first prove that 

I{Z; V\9)< (5^A(7)2 for any 9 G R'^. (34) 

Recall the SVM risk (|26p defined using the individual hinge losses (j24p : by construction, whenever 
X = ei, then the loss is equal to c [r — 9i],. We have 



dl{ei,9) = c 



if 9,: > r 



and di{—ei,9) = c 



if 6li < -r 
e.; otherwise. 



-ei otherwise 
For the remainder of this proof, we use the shorthand 

:= + e"^ + 2{d - 2) 

for the denominator in many of our expressions. By the construction in Proposition [21 we have 



%7 if^^^'^ 
h if > ^> 



Q{Z = Miei \X = ei,9)-- 
and similarly we have for j i that 

Q{Z = Miej \X = ei,9) 
For X = —ei, we have the conditional distribution parallel to ()35p : 

Q{Z = Miei\X = -ei,9)- 



(35) 



h if > ^• 



(36) 



^ if^.>-r 
J- if^,<-r. 
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For any given 9, we have that 



I{Z- V\e) = H[Z\e)- H{Z \V,e)< \og{2d) - ^ ^ | 0, y = z;) (37) 

since the choice of V is uniform and Z takes on at most 2d values. We thus use the conditional 
distributions (f35|) and (f36l) to compute the entropy H[Z \ 6, V) (specifically, the minimal such 
entropy across all values of 6). To do this, we compute the marginal distribution Q(z \ v), arguing 
that H{Z I d,V) is minimal for 6 € int[— r, r]*^. When dj G (— r, r) for all j, we have 

d 

Q{Z = Mid \ V = v,9) = Y^ Q{Z = Mid \ X = ej,9)P{X = ej\V = v) 

i=i 

d 

+ Q{Z = Mid I X = -ej,9)P{X = -ej \ V = v). 

When Vi = 1, we thus have that 

IT. n^ l + 5e-T 1-6 ^ I fl + 6vj l-5v. 
QiZ = M,e,\V = .,l,) = —— + —- + Y.jrA-^^^ 

^ e7 + + 5(6-^ - e^) d-1 ^ ^ 6{e-^ - e^) 

2dD^ dD^ 2d 2dD^ ' ^ 



and under the same condition, 



If for any (possibly multiple) indices j we have 9j (— r, r), then via a bit of algebra and the 
conditional distributions (j35|) and ([36j) . we see that there exists an e G (0, 1) such that 

Q{Z = M,d \V = v,9) = e^ + il-e)(^ + "^^""^ " ""^ 



'2d ' ' \2d 2dDy 

Lemma [TO] then implies that if G int [—r,r]^ while 9' int[— r, r]*^, then 

H{Z \9,V = v) < H{Z \9',V = v). 

Since we seek an upper bound on the mutual information, we may thus assume without loss of 
generality that 9 G int[— r, r]*^. 

Now we compute the entropy H(Z \ 9, v) using the marginal conditional distributions (I38ap 
and ()38bp . which describe Z \ V when 9 G int[— r, r]'^. Indeed, recall the definition in the statement 
of the lemma of the difference A(7). For z G {ztMiCj}^^^, define the relation z ~ f to mean that 
if z = MiCi, then Vi = 1 and if z = —MiCi then Vi = —1. We then see that the entropy is 

H{Z I 9,V = v) = -^Q(z I v,9)logQ{z \ v,9)-Y,Q{z I v,9)\ogQ{z \ v,9) 



2d ' 2d J °^\2d^ 2d ) \2d 2d ) °^\2d 2d J 
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As in the proofs of Lemmas [5] and [6l we use the concavity of log(-) to see that 
-H{Z \d,V = v)= {- + -—- log 7-. + -^-^ + — log ' 



2 2 J \2d 2d J \2 2 ) ^ \2d 2d J 
= -log(2d) + <52A(7)2. 

Invoking the earher bound ([37]) and adding \og{2d) to the above expression completes the proof of 
the claim (IMl). 



B.4 Proof of Lemma [8] 

Let Zj denote the j'th coordinate of Z. We first argue that conditional on V , the random variable 
Z has independent coordinates. Indeed, let g+ = q{z \ x) for z such that x > k and q_ = e~"g_|-. 
Without loss of generality, we may take V = ei, the first basis vector, and hence 

Q(Z = z\V = ei) = Q{Z = z\ X = x)P{X = x\V = ei) 

2;G{-1,1}<* 

d3T 2^ Q(Z = z|X 



2^-1 I 2 



1 



2d-i 



1 + 1 + xi5 



x:(2,a;)>fe x:(2,a;)<fc 



(39) 



Now, if zi = 1, then 



x:{x,2)>fc a;:(a;,2}>fc,xi=l x:(a;,z}>fc,a;i=— 1 



and similarly 



^ 1+^ _ l + J . _ - 1)) + i-i (2--' - C._.(*. + 1) 



x:{x,z)<.k 

On the other hand, we find that if zi = —1, then similar equalities hold, but with the counters 
Cd~i{k - 1) and Cd-i{k + 1) flipped: 

x:{x,z)>k 

= i±i (2-' - c._,(. + 1)) + (2-' - c._,(. - 1) 

x:(x,2}<A; 

In particular, we find that so long as the first coordinate zi = z[ z remains constant, then 
Q{Z = z \ V = ei) = Q{Z = z' \ V = ei), and that we thus have ^2,...,^^ are distributed 
uniformly at random in { — 1, 1}*^. 
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We now determine and compute the marginal value Q{Zi = 1 \ V = ei). For the first, we 
note that 

Cd{k)q+ + {2'' -Cd{k))q- = 1, or Cd{k)q+ + e~''{2'' - Cd{k))q+ = 1, 
which yields the expressions 



Q+ 



and q- 



1 



{e"-l)Cdik) + 2<i (e" -l)Crf(A;) + 2rf' 

By the expression (j39p and calculations following, we thus find that when zi = 1, we have 



q{z I ei] 



1 



2d-i 



q+ \ ^Cd^iik - 1) + i^C7d„i(fc + i; 
+ g_ (■^(2'^"i - C,_i(A; - 1)) + ^(2^^"^ - C,^i{k + 1)) 



2^-ig„ + -{q+ - q^){Cd-i{k - 1) + + 1)) 



1 



+ -{q+ - q^){Ca-i{k - 1) - Cd-i{k + 1)) 



(40a) 



and similarly when zi = — 1 we have 



1 



[z I ei, 



2d-i 



2^-ig_ + - q_){Cd^^{k - 1) + + 1)) 



1 



- g-)(Crf„i(A: - 1) - Cd-i{k + 1)) 



(40b) 



Now note that 

Cd-i{k-l)-Cd-i{k + l) 
and that the difference 



r(d-fc)/2i-i 

E 

1=0 



d-l 

i 



\(d~k) /2\~2 

E 

i=0 



d- 1 

i 



d-l 
\{d-k)/2-\-l 



- 1 



{e^-l)Cd{k) + 2d' 

Recalling the definition of the constant A, we thus find from the expansions (gOil) and diOb]) — since 
they must sum to 1 — that 



Q{Z = z\V = ei) 



i^A(W^ ifzi = l 



(41) 



It is clear that similar statements hold in the other symmetric cases (i.e. \i V = —62, then the 
probabilities depend on Z2 = — 1 or 1). 
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It remains to use the marginahzed representation (j4ip to compute the bound on the mutual 
information in the statement of the lemma. To that end, note that 



I{Z; V) = H{Z) -H{Z\V)<dlog2-^Y.^(^\^ 



dlog2- (d- l)log2 
l A{S,a,d,k) 



1 A{6,a,d,k) 
^ ' 2 2 



+ '2 



1 A{6,a,d,k) 



log (2 



1 A{6,a,d,k) 



,1 A(5,a,d,k) 



1 



log - + A{6, a, d,k) 



1 _ A{6,a,d,k) 

2 2 



log - - A{6,a,d, k) 



= A{6,a,d,kf, 
where the inequality follows from the concavity of p 1— t- log(p). 



B.5 Proof of Lemma [9] 

Recall that for any two probability distributions P, Q, Pinsker's inequality 0] asserts that the total 
variation norm is bounded as ||P — Q||rpy < y^D^\P^Q)~/2. Applying this inequality in our setting, 
we find that 



- QliW^y < y ^^ki {QnQ-i) = ^V^^ki (QillQ-i), 

where we have exploited the product nature of Q". Now we note that by the concavity of the log, 
we have (via the first-order inequality) that log < 25/(1 — 5), so 



1 + 5 



3l inii S ^^^^ ^ + ^ ■ ^"^ 



log jzs + log t 



■log 



1 - 5 



+ 



1 Ai 1 + ^^^ 252 

log = 5 log r < 



1 + 5 



1-5 - 1-5 



Assuming that 5 < 1/3, the final term is upper bounded by 35^. But of course by definition of Qi 
and Q-i, we have 



^ki (QillQ-i) 

which completes the proof. 



1+5 ^^ 1-5, ^ 



l°gTfl + ^l°gTT^<35^ 



C Background on Conditional Probabilities 

In this appendix, we present some basic lemmas on conditional independence and regular conditional 
probabilities that will be useful in Appendix [Pl 

We first recall the following classical data-processing inequality, which holds for essentially 
arbitrary random variables [21I . Chapter 5]: 

Lemma 11 (Data processing). Let X ^ Z ^ Y be a Markov chain. Then I{X]Y) < I{X;Z), 
with equality if and only if X is conditionally independent ofY given Z . 
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This inequality, in conjunction with with Caratheodory and Minkowski's finite-dimensional 



version of the Krein-Milman theorem [e.g. l2j], allows us to argue any Q minimizing /(P, Q) must 
supported on the extreme points of D. To make this point precise, however, we need to address 
certain measurability issues involved in the choice of the extreme points. 
We begin with a precise definition of a regular conditional probability. 

Definition 5. Let {VL^J-) and {T,a{T)) be measurable spaces. A regular conditional probability, 
also known as a Markov kernel or transition probability, is a function u -.T y. T ^ [0, 1] such that 

1 1— 7- A) is measurable for all A ^ T 
u{t, •) : — )• [0, 1] is a probability measure for all t £ T. 

Any Markov chain has a transition probability; conversely, any set of consistent transition proba- 



bilities define a Markov chain (see, e.g.. Chapter 5 of Kallenberg [251]). 

Some difficulties with measurability arise in constructing the appropriate Markov chain for 
our setting. To deal with them, we use results from Choquet theory, which extend Krein-Milman 



theorems to integral representations [35|]. We begin our proof by stating a measurable selection the- 
orem [35I . Theorem 11.4], though we restrict the theorem's statement to subsets of finite dimensional 
space. 

Proposition 4. Let D <ZW^ be a compact convex set. For each x, there exists a probability measure 
fix supported on Ext(-D) such that J^ydfj,x{y) = x. Moreover, the mapping x ^ can be taken 
to be measurable. 

In the statement of this result, measurability is taken with respect to the a-field generated by the 
topology of weak convergence. As a consequence of the proposition, however, it is clear that since 
for any continuous function / the mapping x 1— )• / fdfXx is measurable, we have that for relatively 
open sets A C C the mapping x 1— >• fixi^) is measurable, whence for any measurable set A C C the 
mapping x 1— ?• fix (A) is measurable. That is, we can define the Markov kernel ly-.W^x a{C) [0, 1] 
according to the mapping specified by Proposition 2] (we take v^x, ■) = fix) with the additional 
properties that 

/ yv{x,dy) = x and v{x,D \ Ext(D)) = for all x £ D. 
Jd 

In finite dimensions, a trivial extension of Proposition |4] allows us to drop the assumption that D 



is convex. Indeed, we have that since D is compact, then Ext(D) = Ext(Conv(L))) [24|, Chapter 
III.2]. 

Given this measure-theoretic background, we turn to a key lemma that we will need in Ap- 
pendix [D1 In this lemma, we assume as usual that C C D C M.'^ are compact sets, and that 
Q e Q{C,D) (recall the definition dOb]) ). 



Lemma 12. Let P be a distribution supported on C . If there exists a set A C C with P{A) > 
and a set B C D \ Ext(D) with Q{B \ X = x) > {) for x £ A, there exists a regular conditional 
probability distribution Q' G Q{C,D) where Q'{- \ x) has support contained in Ext(Z)) and 

IiP,Q)>IiP,Q'). 
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Paraphrasing the lemma shghtly, we have that any conditional distribution Q minimizing I{P, Q) 
must (outside of a set of measure zero) be completely supported on the extreme points Ext(-D). 
Proof For any y € D, Proposition H] guarantees that we can represent y as the (regular condi- 
tional) measure v^y, ■). Thus we can define a random variable Z,, distributed according to i/(y, •), 
whose existence we are guaranteed by standard constructions Q, Q with regular conditional prob- 



ability. Then ^[Zy] = Jj^ zv{y, dz) = y, and moreover, we can define the measurable version of the 
conditional expectation E[Zy | Y] via 

E[Zy \Y]= [ zu{Y,dz) = Y 
Jd 

so we have the (almost sure) chain of equalities 

E[Zy \ X = x]= E[E[Zy \Y]\X = x]= [ E[Zy \ Y = y]dQ{y \ X = x) 

Jd 

= [ [ zu{y,dz)dQ{y \ X = x) = [ ydQ{y \ X 
Jd Jd Jd 



x) = X. 



By construction, X — )• y — )■ Z is a valid Markov chain, and since the sets A and B satisfy P{A) > 
and X4 Q{B \ X = x)dP{x) > 0, we see that I{X; Y) > I{X; Z) by Lemma El □ 



We turn to an analogue of Lemma [12] in the differentially private setting. 

Lemma 13. Let the conditions of Lemma{T^hold, and let P be a distribution supported on C. If 
there exists a set A <Z C with P{A) > and a set B <Z D\ Ext(L') with Q{B \ X = x) > for 
X £ A, there exists a regular conditional probability distribution Q' G Q{C,D) where Q'{- \ x) has 
support contained in Ext(Z)), satisfies 

I{P,Q)> I{P,Q'), 
and has no worse differential privacy than Q: 

Q'(S \X = x) Q(S\X = x) 
sup sup . I jr < sup sup -. 

S€aiD)x,x'(^C \ A — X ) Sea{D)x,x'£C WW \ ^ — ^ ) 

Proof Let : M*^ x cr(C) — )■ [0, 1] be the Markov kernel defined in the proof of Lemma [T2l and 
without loss of generality assume that Q{- \ X = x) and Q{- \ X = x') have density q with respect 
to an underlying measure x' ■ Define the distribution 

Q'{S\X = x):= / u{y,dz)q{y\ x)d^ix^x,{y). 
Jd Jd 

By assumption, if Q is a-differentially private, then for /u-almost all y £ D, we have q(y | x) < 
e'^q{y \ x'). We find that 

Q'{S \ X = x) = / u{y,dz)q{y \ x)dfi^y{y) 
Jd Jd 

< [ [ u{y, dz)e^q{y \ x')dfi,y{y) = e^Q'{S \ X = x'), 
Jd Jd 
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so Q' is at least as differentially private as Q. 



□ 



Finally, we will need the following standard maximum entropy result. Let z denote a discrete 
random variable and let q{z \ x) denote the conditional probability mass function oi Z \ X = x. 
Consider the finite dimensional entropy maximization problem 

minimize > q(z I x)logg(z I x) (42) 
q ^ 

z 

subject to zqi^z | x) = x, g(z | x) = 1, q{z | x) > for all z. 
z z 

We have the following lemma, which establishes the form of the solution to the problem ()42|) . We 
include a proof for completeness. 

Lemma 14. The p.m.f. q{- \ x) solving problem (j42p is given by 

l{z I x) = — (43) 

Lz' exp(-/i z') 

where is any vector chosen to satisfy the constraint zq{z | x) = x. Such a fi £ exists. 

Proof We may write the Lagrangian with dual variables fi E M*^, X{z) > 0, and G M, 

C{q, fi, A, 6") = ^ q{z I x) log q{z \ x)+^^ i ^ zq{z | x)-x j +6* f ^ g(z | x)-l j A(z)g(z | x). 

2 ^ Z ^ ^ Z ' Z 

Since the problem (|42p has convex cost, linear constraints, and non-empty domain, strong duality 
obtains 0, Chapter 5], and the KKT conditions hold for the problem. Thus, minimizing q out of 
£ to find the dual, we take derivatives with respect to the m variables q{z \ x) for z = (1 + OL)Ui 
and find the optimal conditional p.m.f. q must satisfy 

log q{z I x) + 1 + [1^ z + — \{z) = 0, or q(^z \ x) = exp(A(z) — 1 — 0) exp(— //"""z). 

In particular, we see that since qi^z | x) > 0, we must have A(2;) = by complementarity, and 
(satisfying the summability constraint qi^z | x) = 1) we see that 

I . exp(-/x'^z) 

L.Z' exp(-^ ' z') 

where G M'^ is any vector chosen to satisfy the constraint zq{/^ | x) = x. The existence of such 
a is guaranteed by the attainment of the KKT conditions. □ 



D Proofs of Minimax Mutual Information Characterizations 

In this section, we provide the proofs of the results stated in Section [U all of which follow a 
broadly similar outline. We make use of Lemma [T2] to guarantee that any conditional distribution 
Q minimizing the mutual information /(P, Q) must be supported on the extreme points of the set 
D. This allows us to reduce computing maximal entropies and minimal mutual information values 
to finite dimensional convex programs, whose optimality we can check using results from convex 
analysis and optimization. 
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D.l Proof of Theorem [4] 



We begin by considering supp, where Q* is defined as in the statement of the theorem. Since the 
support of Q* is finite (there are m extreme points of -D), we have 

/(P, Q*) = I{X; Z) = H{Z) - H{Z \ X) < log(m) - H{Z \ X) 

= log(m) - j H{Z I X = x)dP{x). 

Now, for any distribution P on the set C and for any x G suppP, we can write x as x = Xi{x)ui, 
where Ui are the extreme points of C, and where Aj(x) > and X^jAj(x) = 1 (using the Krein- 
Milman theorem) . Define the individual probability mass functions to be the maximum entropy 
p.m.f. ()43p for each of the extreme points Ui. Then we can define the conditional probability mass 
function by 

q{-\x) = Y,Hx)q'{-)- 

% 

(Without loss of generality, we may assume that the Aj are continuous, since the set of extreme 
points is finite, and thus g(- | x) can be viewed as a regular conditional probability. We can make 
this formal using the techniques in the proof of Lemmall21) Denoting H{q{- \ x)) := H{Z \ X = x), 
we can use the convexity of the negative entropy to see that 

I(P, Q*) < log(m) - Kix)H{q\.))dP{x). (44) 

i 

By symmetry, the entropy H(q'^{-)) = H{Q*{- \ X = Ui)) is a constant determined by the maximum 
entropy distribution (jl3]) . and thus 

I{P,Q*) < log(m) - H{Q*{- I X = Ui)). (45) 

Equality in the upper bound ()45p is attained by taking P* to be the uniform distribution on the 
extreme points {ui} of C. 

It remains to establish an identical lower bound for I{P*,Q) over all conditional distributions 
Q satisfying the constraints of the theorem statement. We know from Lemma [12] that Q must be 
supported on (1 + K)ui for i = 1, . . . ,m. Denoting by q{z \ x) the p.m.f. of Q conditional on x 
(for X in the finite set of extreme points of C that make up the support suppP*), we can write 
minimizing the mutual information as the parametric convex optimization problem 

minimize I '^^Q^z \ x)p{x) j log I ^(^ | x)p{x) j — ^^p(x)^^p(2; | x) log p{z \ x) (46) 

Z \ X / \ X / X z 

subject to ^^p(z I x) = 1 for all x, zp{z | x) = x for all x, p{z | x) > for all x, z. 



In the problem ()46p . the sums over x and z are over the extreme points of C and D, respectively 
and p is the uniform distribution with p{x) = 1 /m. Mutual information is convex in the conditional 
distribution q; moreover, it is strictly convex except when q{z \ x) = q{z \ x')p{x') for all x, z. 
(This can be seen by an inspection of the proof of Theorem 2.7.4 by Cover and Thomas 0].) In our 
case, since Q* does not satisfy this equality, the uniqueness of Q* as the minimizer of I{P*,Q*) 
will follow if we show that Q* is a minimizer at all. 
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We proceed to solve the problem (j46p . Writing I{p, q) as a shorthand for the mutual information, 
we introduce Lagrange multiplers 9{x) S M for the normalization constraints, € for the 

conditional expectation constraints, and A(x, z) > for the nonnegativity constraints. This yields 
the Lagrangian 

C{q,fM,X,e) = I{p,q)-^\{x,z)q{z \ x)+^ fi{x)'^ (^"^zqiz \ x)-x^+'^e{x)(^'^q{z \ x)-l^ . 



If we can satisfy the Karush-Kuhn- Tucker (KKT) conditions (see, e.g., [6|]) for optimality of the 
problem ([l6|) . we will be done. Taking derivatives with respect to q{z | x), we see 

d ( 
dq{z I x) ^^'^' ^' ^ ^^^-^ I + 1] - Pi^) log I x')pix'] 

— q{z) ■ ——-p{x) — \{z, x) + 6{x) + fi{x)'^ z 



■ p{x) log q{z I x) 



p{x) log ^ q{z I x')p(x')^ — A(z, x) + 9{x) + fi{x)~^ z, 



where we set q{z) = q{z \ x')p{x') for shorthand. Now, we use symmetry to note that since we 
have chosen q to be the maximum entropy distribution (I43p for each x in the extreme points {ui} 
of C, the marginal q{z) = q[z \ x')p{x') = 1/m is uniform by the symmetry of the set D and 
since p is uniform. In addition, since q{z | x) > strictly, we have A(z,x) = by complementarity. 
Thus, at q chosen to be the maximum entropy distribution, we can rewrite the derivative of the 
Lagrangian 

— — — /i, A, (9) = — log q{z | x) log h 9{x) + ^(x)'^z. 

oq{z \ x) m mm 

Recalling the definition (|33]) of q{z \ x), and denoting the maximum entropy parameters /i there by 
fJ.*{x), we have 

d 1 X IX 

C{q,fi,X,9) = /i*(x)^z + — log ( y exp(-/i*(x)^z') ) - —log — + 9{x) + n{x)'^ z. 



+ — log ( y exp(-/i*(x)^z') I - — log — + 9{x) + n{x) 
m \, J m m 



dq{z I x) ' ' ' 
Now, by inspection we may set 

9(x) = — log log I exp(— /i*(x)'''z') I and ri(x) = — /U*(x), 

m m m \/ J ^ 

and we satisfy the KKT conditions for the mutual information minimization problem (I46p . 

Summarizing, the conditional distribution Q* specified in the statement of the theorem as the 
maximum entropy distribution (j43]) satisfies 

inf/(P*,Q) > I{P\Q*), 
Q 

which, when combined with the first part of the proof, gives the saddle point inequality 

sup /(P,Q*) < log(m) - H{q{- \ X = m)) = I{P*,Q*) < inf /(P*,Q), 
p Q 

as claimed. 
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Remarks: In the proof of the theorem, we have defined Q*{- | x) as a conditional distribution 
only for x € Ext(C), the extreme points of C. This can easily be remedied: take Q*{- \ x) to be 
the distribution maximizing the entropy H{Z \ X = x) for each x € C under the constraint that 
the support of Z be contained in Ext(-D). This is equivalent to — for each x G C — choosing Z = Zi 
for Zi E Ext(L'), i = 1, . . . ,m, with probability qi, where q £ solves the entropy maximization 
problem 

maximize — qi log qi subject to ZiQi = x, = 1, > 0. 



Inspecting the proof of Theorem [H (see the bound ()44p ) shows that this choice can only decrease 
the mutual information I{X; Z). Additionally, the strong convexity of the entropy over the simplex 
guarantees that the solutions to this optimization problem are continuous in x (see Chapter X 
of Hiriart-Urruty and Lemarechal so this distribution q{- \ x) defines a measurable random 
variable as desired. 

Additionally, though Theorem U] assumes that the sets C and D satisfy D = {1 + k)C for 
some K > 0, inspection of the proof yields a somewhat stronger result. Assume the distribution 
Q maximizing the entropy H{Z \ X = x) satisfies H{Q{- \ X = x)) = H{Q{- \ X = x')) for 
each extreme point x of C and additionally satisfies that for each extreme point z of D the sum 
'^^Q{Z = z \ X = x) is a constant (the sum is over extreme points x of C). Then the upper 
bound ()45p is attained with equality, and a similar calculation yields that Q solves the mtual 
information problem ()46p . Thus, as long as C and D are suitably jointly symmetric, Z should be 
chosen to maximize the entropy H{Z \ X = x) for each x G C. 

D.2 Proof of Proposition [1] 

Using Theorem m (and the remarks immediately following its proof), we can focus on maximizing 
the entropy of the random variable Z conditional on X = x for each fixed j; G [—1, 1]'^. Let Zi 
denote the ith. coordinate of the random vector Z; we take the conditional distribution of Zj to be 
independent of Zj and let Z be distributed as 

\-M w.p.i-i-. 

Let us now verify that the distribution (j47p maximizes the entropy H{Z \ X = x). Indeed, ignoring 
the conditioning we write the entropy maximization problem 

minimize —H{q) subject to ^^^(-z) = 1, q{z) > 0, ''^^zq{z) = x, (48) 

^ z z 

where all sums are taken over z e Ext([-M, M]"') = {-M,M}'^. Introducing the La grange multi- 
plieres multipliers fi G M*^, X{z) > 0, and G R, we find that problem ()48p has the Lagrangian 

C{q, /z. A, 6) = -H{q) - ^ X{z)q{z) + ( J] zq{z) - + ^ qiz) - l) . 

z ^ z ^ ^ z ' 

To find the infimum of the Lagrangian with respect to g, we take derivatives (since we make the 
identification q G M^''). We see that 

Q ^ T. 



dq{z 



-C{q, fx, X, 9) = \og{q{z)) + 1 - X{z) +9 + fi' z. 
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With the definition (|47p of the probability mass function q (that Zi are independent Bernouhi 
random variables with parameters ^ + Xi/2M), the coordinate conditional distributions are 



q{zi I Xi) 



2^ 2M ) 



\ \ 2 ^ 2A/ 



1 

2M 



1 _fifi 

2 2M 



Theorem S] says that without loss of generality we may assume that x G { — 1, l}"^, the full probability 
mass function q can be written 



1 1 

2^2M 



d_ _|_ X 2 

2^ 2i\f 



2M y 



2 2 A/ 



(49) 



Plugging the conditional ()49p results in 



ag(2;; 



■£((?, A, ^) 



2 ^ 2M 



log 



1 ^\ 

2 2My 



+ - 



log 

+ 1 - A(z) + + /i^z. 



2 
1 

2M ) 



x^ z\ 
2m)^°^ 



+ 



x^ z 
2M 



2 2M ) 
/I 

log 



2 ^ 2M 



+ 1 - A(z) + 6/ + 
- log 









2M y 



Performing a few algebraic manipulations with the logarithmic terms, the final equality becomes 

+ 1- A(z) + + ^^z. 



dlog 



V(M + 1)(M-1) 



M 



+ 



M 



log 



M + 1 



M - 1 



The complementarity conditions for optimality 0] imply that A(z) = 0, and since the equality 
constraints in the problem (j48p are satisfied, we can choose Q and /U arbitrarily. Taking 



-lilog 



V(M + 1)(M- 1) 
M 



1 and ii=— X— loe 



M + 1 
M - 1 



yields that the partial derivatives of £ are 0, which shows that indeed our choice of Q* is optimal. 



D.3 Proof of Proposition [2] 

The proof follows along lines similar to the £00 case: we compute the maximum entropy distribution 
subject to the constraint that E[Z] = x for some x G M"^ with ||x||-|^ < 1, and Z must be supported 
on the extreme points ztMcj of the £i-ball of radius M. (Recall that Cj G M*^ are the standard basis 
vectors.) Based on Theorem [U in order to find the minimax mutual information, we need only 
consider the cases where x = itej for some i G {1, . . . , d}. 

Following this plan, we recall the entropy maximization problem (148p . where now x = ztcj and 
the sums are over z G M{ibej}f^^. As in the proof of Proposition [H we can write the Lagrangian 
and take its derivatives, finding that for z = zizMci we have 

T^C{q, fi, A, 9) = log{q{z)) + 1 - X{z) + e - fi'^z. 
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Solving for q{z), we find that 



q{z) = exp(A(2;) — 1 — 6) exp(/i z), 

but complementarity 0] guarantees that X{z) = since q{z) > 0, and normalizing we may write 
q{z) = exp(— /i~''z)/exp(— z'), where the sum is over the extreme points of the £i-ball of 
radius M. In particular, q{Mei) oc e~^* and q{—Mei) oc Without loss of generality, let x = ej. 
Symmetry suggests we take (and we verify this to be true) 

{exp(/_ij) \i z = Mci 
exp(-^i) if z = -Mei (50) 
exp(O) otherwise. 

Indeed, with the choice (i50]l of g, we have q{Mej) — q{—Mej) = for j ^ i, while (setting 7 = /ij 
and normalizing appropriately) 

q{Mei) - q{-Mei) 



e—i + el + 2{d - I) e-^ + e"i + 2{d - 1)' 

Thus, if we can solve the equation Mq{Mei) — Mq(—Mei) = 1, we will be nearly done. To that 
end, we write 

~ ^ J r = — or /3-/3-^ = — (/3 + /3-^ + 2(d-l)) , 

e"/ + e-^ + 2{d-l) M ' M ^ ^ ' 

where we identified /3 = . Multiplying both sides by /?, we have a quadratic equation in /3: 

- 1 = ^ (/32 + 2/3((i - 1) + 1) or (M - l)/?^ - 2((i - 1)^ - (M + 1) = 0, 

whose solution is the positive root of 

^ 2d-2± J{2d-2Y + A{M^ - 1) , (2d -2 + J (2d - 2f + A(M'^ - 1)\ 

P = — or 7 = log ' ' 



2(M-1) ' ''I 2(M-1) 

By our construction, with 7 so defined, we satisfy the constraints that M [q{Mei) — q{—Mei)] = 1 
and q{Mej) — q{—Mej) = for j ^ i. Since q belongs to the exponential family and satisfies the 
constraints, it maximizes the entropy H{Z) as desired 0]. 

Algebraic manipulations and the computation of the conditional entropy H{Z \ X = Cj) give 
the remainder of the statement of the proposition. 



D.4 Proof of Proposition [3] 

The outline of the proof of Proposition [3] is as follows. First, recall from Lemma [13] that any 
distribution satisfying optimal local differential privacy must be supported on the extreme points 
of the outer set D (as in the proof of Theorem Hj). Given this result, we reduce the problem of 
finding an optimally private distribution to a linear program, using symmetry arguments to simplify 
the LP. Finally, we show that the solution to the linear program is unique, which means that we 
have found the unique distribution satisfying optimal local differential privacy. 
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We begin by developing a reduction of the problem of finding a distribution with optimal local 
differential privacy to a linear program. Note that there is a non-increasing mapping between M — 
the radius of the larg er -^QQ ball — and cx^ . Indeed, whenever I\d increases, the set of distributions 
Q from which to choose a privacy channel increases, so a* decreases. Put inversely, for a given 
differential privacy level a, we can find the smallest M such that it is possible to construct an 
a-differentially private channel Q mapping from [-1, 1]'' to [-M,M]'^. (Lemma [H] shows that the 
mapping from M to a* is implicitly invertible.) 

We thus take the view of finding the largest M such that an a-differentially private distribution 
exists. Fix d E N and (with some abuse of notation) let Z G {—1,1}'^^^ be the matrix whose 
columns are the edges of the hypercube { — 1,1}'^. For each z,x G { — 1,1}'^, define the variables 
q{z I x) > to represent the conditional probabilities, using q{x) G [0, 1]^ to represent the vector 
[q{z I j^jd. Then the linear program 

minimize —t (51) 
subject to Zq{x) - tx = for aU x G {-1, 1}"^ 

q{z I x) < e°'q{z \ x') for all x,x',z G { — 1,1}'^ 
q{z I x) = 1, q{x) >z for all x G {-1, l}'^, 

z 

at its solution t* (this solution is guaranteed to exist, since the vectors q live in a compact set), 
yields the smallest value M = 1/t* for which it is possible to have an a-differentially private channel 
Q. The solution vectors q{x) give one such channel. 

It is possible to calculate the solution of the LP ()5ip by hand, but it is tedious. We thus use 
the structure of optimal local differential privacy to reduce the problem to a single minimization 
problem over a vector g G (rather than a matrix [q{x)] G ). We have 

Lemma 15. A distribution satisfying optimal local differential privacy must, for each x G {—1, 1}"^, 
have q{x) = Il{x)q, where n(x) G {0, 1}^ is a permutation matrix and q is a fixed vector. 

Proof Suppose for the sake of contradiction that this is not the case, but the vectors q{x) and t 
solve the linear program (I5ip . Let Qi denote the matrix of the vectors q{x). Choose vectors q{x) 
and q{x') such that q{x) ^ Yiq[x') for any permutation matrix H. Now construct vectors q2{x) and 
q2{x') such that q2{z \ x) = q{z' \ x'), where z' is chosen so that and similarly choose q2 

so that q2{z \ x') = q{z' \ x), where Zix'^ = z'-Xi. Let Q2 denote the matrix of vectors q, but where 
q2{x) and q2{x') replace q{x),q{x'). Then by construction, all the constraints of the original linear 
program (jSip are satisfied. By symmetry and the strict convexity of the mutual information in the 
channel distribution Q, however, we see that 

I{P, Qi) = I{P, Q2) = \ {I{P, Qi) + I{P, Q2)) > I (^P, ^iQi + Q2, 
The decrease in mutual information gives the necessary contradiction. □ 



With Lemma [15] in hand, we can now turn to the smaller linear program — in a single vector 
q and for a single vector x G {—1, 1}*^ — that will give us the locally optimal differentially private 
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channel. Indeed, we consider the linear program in the variables t € M and q G M^"^, where q is 
indexed by the column z oi Z. 

minimize — t (52) 
subject to Zq — tx = 

q{z) < e°'q{z') for all z,z', 

z 

Define the constants 

K2J 



L-^J (d\ r . w T 1 \^ if d odd 

We have the following lemma, which characterizes the structure of the solution vector q. 

Lemma 16. Define a* = log ^jta-Cd''^ ' ^'^^ a < a* , the unique solution to the linear pro- 
gram (j52|) is given by 



Proof First, problem (j52p is clearly equivalent to the linear program 



minimize — t (53) 
subject to Zq — tx = 

max{g(z)} + e" max{— g(z)} < 

z z 

J^(?(z) = l, q>0. 

z 

Our proof proceeds in two large steps: first, we argue that a g of the form specified in the lemma 
is indeed the solution to the problem (|53p. then we use results on uniqueness of solutions to linear 
programs due to Mangasarian [s^ ]. 

For the first step, we begin by writing the Lagrangian to the problem (j53p . We introduce dual 
variables 9 € M^'' for the constraint Zq — tx = 0, X > for the first inequality, r G M for the sum 
constraint, and /3 G for the non-negativity of q. With this, we have Lagrangian 

C{q,t,e,X,T,P) = -t+e'^ -tej+Amax{g(z)}+e"max{-g(z)}+r(l'^g-l)-/?'^g. (54) 

Recall the generalized subgradient KKT conditions for optimality of the solution to an optimization 
problem [2J, Chapter VII]. A vector g > is optimal for the problem (|53p if the constraints 
maxjl^j} < minj{gj} and % = 1 hold, there is a i > such that Zq — tx = 0, and we can find 
9, X, and r such that 

Z^e + X[v+- e"u_] + rl = 0, /3 = 0, and e'^x = -1, (55) 
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where f+ and V- are vectors satisfying 



v+ G Conv |ej : = max{gi}| and V- £ Conv |ej : = min{gj}|. 

That /3 = follows by complementarity (recall that g > is assumed). 

If we can find settings for the vectors 6,X,t, and v± satisfying the KKT conditions ()55p . we are 
done. To that end, set 6 = —x/d. Then by inspection 6~^x = — \\x\\2 /d = —1, and we can rewrite 
the remaining KKT condition by noting that we must find vectors z;+ , V-, and r E M such that 

-^Z^x + v+-e°'v- + Tl = 0, vll = vll, ?;+>0,i;_>0, 
v+(z) = if q{z) < max{g(z)}, and v^(z) = if q(z) > mm{q(z)}. 

z z 

Note that we have eliminated A as it is a non- negative homogeneous scaling term on and v^. 
Now we assume that we choose the two values q+,q~ with < g_ < (74- such that q{z) = g+ when 
z~^ x > and q{z) = q_ when z'^x < 0. (Clearly such values can be chosen such that q{z) = 1.) 
We will choose the values of f+, V-, and r satisfying the KKT conditions. Indeed, set 

.4^) = |*-- \^"->0 and M^) = | -^-"* + ^""- ''f-^' (56) 
[ otherwise [ otherwise 

By inspection, we see that —Z^x/d + v+ — e"i)_ + r = 0, so the only question remaining is whether 
we can choose r such that v± > and vjl = vjl. 

To that end, we recall the definition of the constant K^, and we seek r such that 

Y,v+{z) = -^Kd - rCd = e-'^^Kd + e-r(2'^ - Cd) = ^V-i^) 

z z 

by the symmetry in the sums. Rewriting the equation, we find that for equality we must have 

r(e {2 -Cd) + Cd)=-Kdil-e ) or ^ = ^ " ^ _ = ^ " ^ - 1 " 

Thus we find that if a is such that 

e" - 1 1 

dCd e° + 2'i/Cd - 1 ^ d' ^^^^ 

then by our choice ()56p of the vectors and V-, we have v^{z) > whenever z'^x > 0, and 
V-{z) > whenever z'^x < 0. Noting that by our setting of ^(2:), we have by symmetry of Z that 
there exists a t > such that Zq = tx, we find that our choice of q is optimal. 

We have two arguments remaining in the proof. The first is to show that for a < a* defined in 
the statement of the lemma, the inequality ()57p holds. Rewriting the inequality, we solve 

a 1 Cd f a , nd /ri A a ( '^d\ 2'^ - Cd . ^ , Kd + 2'^ - Cd 

e — 1 = — I e + 2 /Cd - 1 I or e 1 - — = — — h 1, i.e. a = log 



Kd\ ' " y \ KdJ Kd ° Kd-Cd 

For any a < a* , the strict inequality (j57p holds, so the setting ()56p of v+ and V- satisfy the KKT 
conditions. 
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Our last argument regards the uniqueness of the two-valued solution vector q. For that, we 
apply Mangasarian's result [s^, Theorem 1] that if there exists an e > such that for any vector 
u G M'^'' with ||u||2 = 1, (? is a solution of the linear program (|52|) when the objective is —t + eu^q, 
then q is unique. Luckily, this is not difficult given our previous work. The Lagrangian (j54p for the 
modified linear program becomes 

eu'^q -t + e'^ q{z) - tx^ + A max{g(z)} + e" max{-g(z)} + t{\^ q - 1) - 13^ q. 

The only modification in our KKT conditions (155p is that the first equality becomes 

eu + Z^9 + A [v+ - e"t>_] + rl = 0. 

By the strictness of the inequalities vj^{z) > for z such that z'^x > (and similarly for v^) in 
the definitions (j56p whenever a < a*, we see that for suitably small e > 0, the vectors v+ and V- 
can be perturbed so that the KKT conditions are still satisfied. This proves the uniqueness of the 
two- valued solution vector q. □ 



Remarks: Following an argument with completely the same structure as the proof, we see that 
for any d G N (say d > 3), there are different "regimes" of a, that is, there exists a sequence 
ag, • • • ) o^d-i (o^ '^d~2 even) such that for a G {<^2ii '^21+2) ^ the unique optimal solution to 

the linear program ()52p is given by taking 



{z) oc 



exp(a) for z s.t. {z,x) > 2{i + 1) 
1 for z s.t. {z,x) < 2{i + 1) 



(for a < Oq, we say i = —1 above). For a = Ojii the set of solutions is given by the convex 
combinations of the solution vectors 

, . I exp(a) for z s.t. (z,x) > 2i f exp(a) for z s.t. (z,x) > 2(i + 1) 

q^iz) oc < and q>{z) oc < 

ll foT z s.t. {z,x) <2i (1 for z s.t. < 2(i + 1), 



which follows from arguments similar to our application of Mangasarian's results [32] ■ (Recall also 
the argument with convex combinations of ratios following Proposition [3l) 

Now we may complete the proof of Proposition [3l Indeed, we see from Lemma [16] that the 
distribution satisfying optimal local differential privacy must assign probability masses at two 
levels — at least when the point being perturbed comes from { — 1,1}'^. Now let Q be a distribution 
specified in the lemma. An argument identical to that in our proof of Proposition[T] — by symmetry — 
shows that the distribution P maximizing the mutual information I{P, Q) is uniform on {—1, l}*^. 
The uniqueness of Q then follows from Lemmas [15] and [TBI which show that such Q is the only 
distribution that minimizes the radius M of the ball [— M, M]'^; inverting this bound gives the 
proposition. □ 
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E Proof of Corollary |4] 

First, we claim that as 7 — )• 0, the foUowing expansion holds: 

log(2(i)-log(e^ + e-^ + 2d-2)+7 ^ 7 = 1-+q{\]. (58) 

Before proving this, we use the expansion (|58p to prove Corollary SI Noting that 



2d-2 + ^{2d-2Y + A{M^-l) /^TT^ d-1 ^„,,2/,,2, 

m^) = Vm^ + m3t+®^^/'''^' 

we see that since log(l + x) = x — + 0(x^), we have 7 = ;p + (^1^) • Thus the mutual 
information in Proposition [2] is 



j^p. ^,^^ log\V{M + l)/{M-l) + d/M + QidyM')) ^ ^(\og\l + d/M)\ 
— + B mm ' 



j^ log^(l + d/M) ^ 



2M2 \ I M4 ' d 



Now we return to showing the claim (j58|) . Indeed, define /(7) = log(e'^ + e + 2(i — 2). Taking 
several derivatives, we have 

f (1) = .(2) . . ^ (e^ + e-7)(2d-2)+4 

eT + e-T + 2d-2' ^ (e^ + e'T + 2d - 2)2 ' 

and 

(3) ^ -(e27 _ e-27)(2d - 2) - 8(e^ - e"^) + (2rf - 2)2(e^ - e"^) 
^''^ {e^ + e-^ + 2d-2f 

Via a Taylor expansion, we have 

log(2d) = log(e^ - e-^ + 2d - 2) + (0 - 7)/^'H7) + ^^^^/^'^(t) + O (/^'H7)7') • 
Recalling our calculation of the first derivative f^^\j), we thus see that 
log(2d) - log (e^ + e-^ + 2d - 2) + 7 ,,^ "1^, ^ "7 



e'y + e-T + 2d - 2 ' + e-T + 2d - 2 
2 



^ (e^ + e-^)(2d-2)+4 _7l . (3)( 
(e7 + e-T + 2d - 2)2 2 V ^^^^ 

A few simpler Taylor expansions yield that = 0{-f/d), which means that all we have left to 

tackle is f^'^\'y)- But noting that 

2(e7 + e-^)=4(l + ^ + ^ + ...) = 4 + 0(7^) 
implies that f^'^\'j)j'^ = 7^ + 0{j^/d), which yields the result. □ 
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